LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631647
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000S85
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 407-731-819-668-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Santa Barbara Corpus of Spoken American English Part I
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000S85
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings
of natural speech from all over the United States, representing a wide variety of
people of different regional origins, ages, occupations, and ethnic and social backgrounds.
It reflects many ways that people use language in their lives: conversation, gossip,
arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom
lectures, political speeches, bedtime stories, sermons, weddings, and more. *Data*
Part I contains 14 speech files of between 15-30 minutes each, from the Santa Barbara
Corpus of Spoken American English. Collected by: University of California, Santa Barbara
Center for the Study of Discourse, Director John W. Du Bois (UCSB), Associate Editors:
Wallace L. Chafe (UCSB), Charlese Meyer (UMass, Boston), and Sandra A. Thompson (UCSB).
The Santa Barbara Corpus of Spoken American English is part of the International Corpus
of English (Charles W. Meyer, Director), representing the American Component. Each
speech file is accompanied by a transcript in which phrases are time stamped with
respect to the audio recording. Personal names, place names, phone numbers, etc.,
in the transcripts have been altered to preserve the anonymity of the speakers and
their acquaintances and the audio files have been filtered to make these portions
of the recordings unrecognizable.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Du Bois, John W.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chafe, Wallace L.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meyer, Charles
ADDED ENTRY--PERSONAL NAME
- Personal name:
Thompson, Sandra A.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000S85
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631728
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000S86
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 786-335-176-662-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1998 HUB4 Broadcast News Evaluation English Test Material
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000S86
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the evaluation test material used in the 1998 DARPA/NIST
Continuous Speech Recognition Broadcast News HUB4 English Benchmark Test administered
by the NIST Spoken Natural Language Processing Group and produced by the Linguistic
Data Consortium (LDC), catalog number LDC2000S86, ISBN 1-58563-172-8. *Data* The test
material is contained in two SPHERE-formatted waveform files. The file h4e_98_1.sph
(set1) contains 1.5 hours of Broadcast News excerpts from 1996. The file h4e_98_2.sph
(set2) contains 1.5 hours of Broadcast News excerpts from 1998. Each file should be
separately recognized per the HUB4 English Evaluation Specification.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000S86
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631736
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000S87
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 884-501-625-360-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments (SPINE) Training Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000S87
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the Speech in Noisy Environments (SPINE) Training Audio
Corpus created for the Department of Defense (DoD) Digital Voice Processing Consortium
(DDVPC) by Arcon Corp., and produced by the Linguistic Data Consortium (LDC) catalog
number LDC2000S87, ISBN 1-58563-173-6. A companion corpus, Speech in Noisy Environments
(SPINE) Training Transcripts, was also produced by the Linguistic Data Consortium
(LDC) catalog number LDC2000T49 and ISBN 1-58563-174-4. These corpora support the
2000 Speech in Noisy Environments (SPINE1) evaluation. The 2000 Speech in Noisy Environments
Evaluation (SPINE1) is a first attempt to assess the state of the art and practice
in speech recognition technology in noisy military environments and to exchange information
on innovative speech recognition technology in the context of fully implemented systems
that perform realistic tasks. It is intended to be of interest to all university,
industrial and commercial speech system developers working on the problem of robust
speech recognition. The evaluation gives participants the opportunity to participate
in a flexible evaluation, suited to development needs and abilities. The SPINE1 evaluation
focuses on the task of transcribing speech produced in noisy environments with emphasis
on noisy military environments. The evaluation is designed to promote research progress
in this area, to provide the opportunity for participants to try out new ideas for
developing robust speech recognition systems that are of both scientific and practical
interest, and to measure the performance of this technology. More information on this
evaluation is available at SPINE1. This work was sponsored in part by National Science
Foundation Grant No. IIS-9982201. The transcripts for this release are available as
Speech in Noisy Environments (SPINE) Training Transcripts (LDC2000T49) *Data* The
evaluation task is to transcribe speech produced in noisy environments. The training
and test speech data to be used for this evaluation were generated by ARCON Corp.
for the DoD Digital Voice Processing Consortium (DDVPC) under controlled conditions.
The speech data consists of conversations between two communicators working on a collaborative,
Battleship-like task in which they seek and shoot at targets (ARCON Communicability
Exercise, ACE). Participants may talk freely, but the total vocabulary used is fairly
limited. Each person is seated in a sound chamber in which a previously recorded military
background noise environment is accurately reproduced. The participants use handsets
and transmission channels that are resident to the particular environment. The training
data includes 10 of twenty available talker pairs with 14 five-minute conversations
per talker pair (about 720 minutes total), which include four noise scenarios.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Nielsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tardelli, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gatewood, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kreamer, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tremain, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wright, Jonathan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000S87
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631760
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000S88
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 691-755-940-811-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1999 HUB4 Broadcast News Evaluation English Test Material
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000S88
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the English evaluation test material used in the 1999 NIST
Broadcast News Transcription Evaluation administered by the NIST, Spoken Natural Language
Processing Group and produced by the Linguistic Data ConsortiumCatalog number LDC2000S88
ISBN 1-58563-176-0. *Data* The test material is contained in two SPHERE-formatted
waveform files. The file bn99en_1.sph (set1) contains 1.5 hours of Broadcast News
excerpts from last year's set2 epoch. The file bn99en_2.sph (set2) contains 1.5 hours
of Broadcast News excerpts from the summer of 1998. Each file should be separately
recognized per the Broadcast News English Evaluation Specification. Additional test
material for each set is also included. Test materials include evaluation map files
(bn99en_1.uem), automatically generated segmentation files (bn99en_1.seg), transcripts
from the evaluation (bn99en_1.utf) and the utf.dtd used to validate the transcripts,
reference STM files (bn99en_1.stm), and transcript orthography mapping files (en981118.glm).
For more complete information, see the 1998 HUB4 Website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000S88
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631795
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000S89
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 748-783-667-076-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Voice of America (VOA) Czech Broadcast News Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000S89
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Voice of America (VOA) Czech Broadcast News Audio was developed by the Linguistic
Data Consortium (LDC). Corresponding transcripts are contained in Voice of America
(VOA) Czech Broadcast News Transcripts (LDC2000T53), the documentation for which is
included with this release. *Data* Between February 9 and May 28, 1999, LDC collected
approximately 30 hours of Czech broadcast audio from the Voice of America news service.
The 62 data files presented in this corpus represent the audio of the daily broadcasts
of 30-minute news programs. Due to technical limitations in the hardware at LDC that
was used to receive the VOA broadcasts via a satellite downlink, a number of files
contain brief portions where the audio signal was interrupted. These interruptions
typically yielded regions of complete silence that lasted less than two seconds and
were scattered sparsely throughout an affected audio file. Additional markup was provided
in the transcription texts to isolate the regions where these interruptions occurred.
The 62 audio files in this corpus are single-channel, 16 KHz, 16-bit linear SPHERE
files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Czech. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000S89
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631671
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000S92
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 163-189-179-812-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT2 Careful Transcription Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000S92
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TDT2 (Topic Detection and Tracking) Careful Transcription Audio was developed by the
Linguistic Data Consortium (LDC) and contains English broadcast news audio recordings
collected by LDC in 1998. Corresponding transcripts are available in TDT2 Careful
Transcription Text LDC2000T44. Topic Detection and Tracking refers to automatic techniques
for finding topically-related material in streams of data such as newswire and broadcast
news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous
sections (segmentation), detect the occurrence of new events (detection) and track
the reoccurrence of old or new events (tracking). *Data* This publication contains
1998 broadcasts from the following sources: ABC News, Cable News Network, Public Radio
International and Voice of America.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000S92
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631884
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000S96
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 940-433-236-519-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments (SPINE) Evaluation Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000S96
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the Speech in Noisy Environments (SPINE) Evaluation Audio
Corpus created for the Department of Defense (DoD) Digital Voice Processing Consortium
(DDVPC) by Arcon Corp., and produced by the Linguistic Data Consortium (LDC) catalog
number LDC2000S96, ISBN 1-58563-188-4. A companion corpus, Speech in Noisy Environments
(SPINE) Audio Transcripts, was also produced by the Linguistic Data Consortium (LDC)
catalog number LDC2000T54 and ISBN 1-58563-189-2. These corpora support the 2000 Speech
in Noisy Environments (SPINE1) evaluation. There are a total of 120 files, one conversation
each, for a rough total of nine hours and 22 minutes (2.2 Gigabytes) of audio data.
For an example of a corresponding transcript from the Speech in Noisy Environments
(SPINE) Evaluation Transcripts Corpus, please click here. Due to size and format considerations,
no example of a speech file is provided. The 2000 Speech in Noisy Environments Evaluation
(SPINE1) is a first attempt to assess the state of the art and practice in speech
recognition technology in noisy military environments and to exchange information
on innovative speech recognition technology in the context of fully implemented systems
that perform realistic tasks. It is intended to be of interest to all university,
industrial and commercial speech system developers working on the problem of robust
speech recognition. The evaluation gives participants the opportunity to participate
in a flexible evaluation, suited to development needs and abilities. The SPINE1 evaluation
focuses on the task of transcribing speech produced in noisy environments with emphasis
on noisy military environments. The evaluation is designed to promote research progress
in this area, to provide the opportunity for participants to try out new ideas for
developing robust speech recognition systems that are of both scientific and practical
interest, and to measure the performance of this technology. More information on this
evaluation is available at SPINE1. This work was sponsored in part by National Science
Foundation Grant No. IIS-9982201. *Data* The evaluation task is to transcribe speech
produced in noisy environments. The training and test speech data to be used for this
evaluation were generated by ARCON Corp. for the DoD Digital Voice Processing Consortium
(DDVPC) under controlled conditions. The speech data consists of conversations between
two communicators working on a collaborative, Battleship-like task in which they seek
and shoot at targets (ARCON Communicability Exercise, ACE). Participants may talk
freely, but the total vocabulary used is fairly limited. Each person is seated in
a sound chamber in which a previously recorded military background noise environment
is accurately reproduced. The participants use handsets and transmission channels
that are resident to the particular environment. The evaluation data includes 20 talker-pairs,
with six five-minutes conversations per talker-pair (about 600 minutes total), from
a set of four scenarios.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Nielsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tardelli, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gatewood, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kreamer, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tremain, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wright, Jonathan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000S96
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631655
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T43
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 233-420-716-637-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BLLIP 1987-89 WSJ Corpus Release 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T43
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Brown Laboratory for Linguistic Information Processing (BLLIP)1987-89 WSJ Corpus Release
1 contains a complete, Treebank-style part-of-speech (POS) tagged and parsed version
of the three-year Wall Street Journal (WSJ) collection from ACL/DCI (LDC93T1), approximately
30 million words. The annotation was performed using statistically-based methods developed
by BLIIP researchers Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale
and Mark Johnson. This corpus both overlaps and supplements the million-word Penn
Treebank (PTB) collection of parsed and POS-tagged WSJ texts. *Data* The PTB project
selected 2,499 stories from a three-year WSJ collection of 98,732 stories for syntactic
annotation. These 2,499 stories are distributed in Treebank-2 (LDC95T7) and Treebank-3
(LDC99T42), both of which include the raw text for each story.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Charniak, Eugene
ADDED ENTRY--PERSONAL NAME
- Personal name:
Blaheta, Don
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hall, Keith
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hale, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Johnson, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T43
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631663
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T44
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 387-208-758-013-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT2 Careful Transcription Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T44
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TDT2 (Topic Detection and Tracking) Careful Transcription was developed by the Linguistic
Data Consortium (LDC) and contains transcripts of English broadcast news audio recordings
collected by LDC in 1998. The corresponding audio data is available in TDT2 Careful
Transcription Audio LDC2000S92. Topic Detection and Tracking refers to automatic techniques
for finding topically-related material in streams of data such as newswire and broadcast
news. This corpus was created to support three TDT2 tasks: to find topically homogeneous
sections (segmentation), to detect the occurrence of new events (detection) and to
track the reoccurrence of old or new events (tracking). *Data* The broadcast data
was collected from the following sources: ABC News, Cable News Network, Public Radio
International and Voice of America. Please look at this sample transcript.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T44
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u kor d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T45
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 210-777-697-418-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T45
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus is a collection of Korean Press Agency news articles from June 2, 1994
to March 20, 2000. The collection includes articles from the date ranges listed below.
Please click here to see an example of the newswire. Not all dates in each interval
are represented by files or articles: 1994 Jun. 2 to Dec. 31 87 files, 8.6 MB 1995
Jan. 1 to Dec. 31 179 files, 16.9 MB 1996 Jan. 1 to Mar. 29 83 files, 10.6 MB 1997
Jul 28 to Dec. 31 245 files, 48.9 MB 1998 Jan. 2 to Dec. 31 285 files, 64.2 MB 1999
Jan. 3 to Dec. 31 216 files, 56.7 MB 2000 Jan. 3 to Mar. 20 56 files, 13.6 MB Total
1,151 files 219.5 MB *Data* The articles provided here have been collected by means
of a continuous feed from the news provider over a modem connection. Incoming data
from the modem was spooled directly to a "raw collection" file on a daily basis and
the raw files were then processed to produce the format for release by the LDC. There
are approximately 143,137 articles this corpus. It is probable that there are duplicate
articles in this corpus. We have taken steps to remove articles that were corrupted
by failures or noise in modem transmission. The kinds of corruption that we were able
to eliminate include truncated articles (a valid end-of-article sequence is not observed
before a valid start-of-article) and invalid character codes within the text segment
of articles. Some corruption may have occurred that did not produce these symptoms
(e.g. service interruptions that might cause partial loss of data within or across
articles or corruptions that garble the content but happen not to produce any invalid
character codes). At present we have no means for detecting these more subtle problems
in the data, but we expect that they are relatively infrequent. The format chosen
for release consists of SGML tagging (since this gives a fairly simple and self-explanatory
presentation of the data) and the KSC-5601 Korean character encoding.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Andy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T45
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631698
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T46
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 820-981-482-765-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Hong Kong News Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T46
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Hong Kong News Parallel Text was developed by the Linguistic Data Consortium (LDC)
and consists of parallel Chinese - English news articles from the Information Services
Department of Hong Kong Special Administrative Region (HKSAR) of the Peoples Republic
of China. LDC wishes to thank the Hong Kong Special Administrative Region of the Peoples
Republic of China for granting the LDC permission to distribute this data to the research
community. *Data* This corpora contains 18,147 aligned article pairs released by HKSAR
from July 1, 1997 to April 30, 2000. Automatic article alignment was done at the LDC.
The data directory contains 36,294 articles. Each article is a separate file, thus
there are 18,147 article pairs. The files are named using the convention yyyymmdd_nnn.[ce]
where * yyyy = year * mm = month * dd = date * nnn = article date sequence number
* c = Cantonese, and e = English. The example.c and example.e files contains a corresponding
sample news article from the corpus. The articles were collected by an automated system
from the internet. Incoming data was spooled directly to a raw collection file and
the raw files were then processed to produce the following format for release by the
LDC. Table.txt maps the Chinese files (*.c) to the corresponding English files (*.e).
The Chinese files are encoded in BIG5 with user-defined characters by HKSAR. Click
here for details.
LANGUAGE NOTE
- Language note:
Content in English and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T46
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631701
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T47
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 596-847-245-337-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Hong Kong Laws Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T47
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Hong Kong Laws Parallel Text was developed by the Linguistic Data Consortium (LDC)
and consists of processed and sentenced-aligned Chinese-English documents from the
Department of Justice of the Hong Kong Special Administrative Region (HKSAR) of the
Peoples Republic of China. LDC wishes to thank the Hong Kong Special Administrative
Region of the Peoples Republic of China for granting the LDC permission to distribute
this data to the research community. *DATA* This corpora is organized into 19 parallel
file pairs for a total of 38 files. Each parallel file pair is named hklaws.nn.[ec]
where: * nn = sequence number and * the file extensions, c = Cantonese and e = English
Each files holds up to 2,000 sequentially numbered sentences tagged with a sentence
index and sequence number as described below for a total of 37,807 sentence indices
across all 19 file pairs. The sentence numbering spans the file pairs such that the
initial sentence index (in files hklaws.01.e and hklaws.01.c) is 1, and the last sentence
index (in files hklaws.19.e and hklaws.19.c) is 37807. The sentence numbering establishes
the sentence parallelism two sentences having the same index and sequence number are
purported to be parallel in content. Each sentence index may contain one or more sequentially
numbered sentences, with corresponding files in English and Chinese containing the
corresponding sets of sentences. The initial sequence number of each sentence is 1.
The sentence sequence number plus the sentence index number is sufficient to uniquely
identify parallel sentences. There are 313,659 sentences in the corpora. Each sentence
is of the form:...... ...... where # represents a one to five digit sentence index
or sequence number. Automatic sentence alignment was done at the LDC. The example.c
and example.e files contains sample corresponding Chinese and English Law files from
the corpus. The Chinese files are encoded in BIG5 with user-defined characters by
HKSAR. See http://www.info.gov.hk/gccs for details.
LANGUAGE NOTE
- Language note:
Content in English and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T47
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631744
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T49
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 176-611-193-688-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments (SPINE) Training Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T49
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the Speech in Noisy Environments (SPINE) Training Transcripts,
created for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC)
by Arcon Corp., and produced by the Linguistic Data Consortium (LDC) catalog number
LDC2000T49 and ISBN 1-58563-174-4. A companion corpus, Speech in Noisy Environments
(SPINE) Training Audio, was also produced by the Linguistic Data Consortium (LDC)
catalog number LDC2000S87, ISBN 1-58563-173-6. These corpora support the 2000 Speech
in Noisy Environments evaluation. For an example transcript, please click here. The
2000 Speech in Noisy Environments Evaluation (SPINE1) is a first attempt to assess
the state of the art and practice in speech recognition technology in noisy military
environments and to exchange information on innovative speech recognition technology
in the context of fully implemented systems that perform realistic tasks. It is intended
to be of interest to all university, industrial and commercial speech system developers
working on the problem of robust speech recognition. The evaluation gives participants
the opportunity to participate in a flexible evaluation, suited to development needs
and abilities. The SPINE1 evaluation focuses on the task of transcribing speech produced
in noisy environments with the emphasis on speech produced in noisy military environments.
The evaluation is designed to promote research progress in this area, to provide the
opportunity for participants to try out new ideas for developing robust speech recognition
systems that are of both scientific and practical interest, and to measure the performance
of this technology. More information on this evaluation is available at SPINE1. This
work was sponsored in part by National Science Foundation Grant No. IIS-9982201. Corresponding
Audio is available as Speech in Noisy Environments (SPINE) Training Audio (LDC2000S87)
*Data* The evaluation task is to transcribe speech produced in noisy environments.
The training and test speech data to be used for this evaluation were generated by
ARCON Corp. for the DoD Digital Voice Processing Consortium (DDVPC) under controlled
conditions. The speech data consists of conversations between two communicators working
on a collaborative, Battleship-like task in which they seek and shoot at targets (ARCON
Communicability Exercise, ACE). Participants may talk freely, but the total vocabulary
used is fairly limited. Each person is seated in a sound chamber in which a previously
recorded military background noise environment is accurately reproduced. The participants
use handsets and transmission channels that are resident to the particular environment.
The training data includes 10 of 20 available talker pairs with 14 five-minute conversations
per talker pair (about 720 minutes total) available, which include four noise scenarios.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Nielsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rennert, Kara
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T49
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631752
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T50
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 272-276-125-586-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Hong Kong Hansards Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T50
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Hong Kong Hansards Parallel Text was developed by the Linguistic Data Consortium (LDC)
and contains excerpts from the Official Record of Proceedings of the Legislative Council
of the Hong Kong Special Administrative Region (HKSAR) from October 1995 to April
2000. LDC thanks the Hong Kong Special Administrative Region of the Peoples Republic
of China for granting permission to distribute this data to the research community.
The Legislative Council normally meets every Wednesday afternoon in the Chamber of
the Legislative Council Building. Business includes: discussion of subsidiary legislation,
papers, reports, addresses, statements, questions, the three readings of bills, motions
and debates. From time to time, the Chief Executive attends a special Council meeting
to brief Members on policy issues and to answer questions from Members. All Council
meetings are open to the public. The proceedings of the meetings are recorded verbatim
in the Official Record of Proceedings of the Legislative Council (Hansard). The record
of proceedings is in the original language delivered by the speakers (Floor Version).
They are then translated into English and Chinese versions separately. *Data* This
corpus contains excerpts from the official record of meetings from October 1995 to
April 2000. There are 11.9 million English words and 18.15 million Chinese characters
in this release. Chinese text is presented in the traditional script and encoded as
BIG5. There are 388 files in the data/ subdirectory of this corpus, half (194 files)
in English in the data/english/ subdirectory and half (194 files) in Chinese in the
data/chinese/ subdirectory. Data file names are in the form YYYYMMDD_[ce].doc, where
YYYYMMDD indicates the date of the meeting, c=Chinese and e=English. As an example
of the text in this corpus the Chinese sample is part of the Chinese language record
of the meeting held on May 24, 1997. The parallel English file is in the English sample.
LANGUAGE NOTE
- Language note:
Content in English and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T50
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631779
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T51
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 445-901-162-731-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T51
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the TREC Spanish Corpus produced by the Linguistic Data
Consortium (LDC) catalog number LDC2000T51, ISBN 1-58563-177-9. This is the set of
documents used for the Spanish task in TRECs 3-5. It consists of approximately 250
megabytes of the Mexican newspaper El Norte and 300 megabytes of Agence France Presse
1994 newswire text formatted to include TREC document IDs. The El Norte documents
were used for TRECs 3-4 and the Agence France Presse documents were used for TREC
5. The topics (questions) and relevance judgments (right answers) that complete the
test collections can be downloaded from the TREC web site in the Data/Non-English
section. *Data* Please look at file.tbl for the directory structure of this publication,
as well as a complete list of files. The files in the afp_text and infosel_data subdirectories
are ASCII encoded SGML files that conform to the afp_trec.dtd and infosel.dtd files
found in the doc subdirectory.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rogers, Willie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T51
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631787
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T52
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 964-663-671-938-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T52
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the TREC ("Text REtreival Conference") Mandarin Corpus used
for the Chinese task in TRECs 5-6 and consist of approximately 170 megabytes of articles
drawn from the People's Daily newspaper and the Xinhua newswire formatted to include
TREC document IDs. The text is Mandarin Chinese and is encoded using the GB encoding
scheme. The topics (questions) and relevance judgments (right answers) are not included
in this publication but can be downloaded from the Data/Non-English section of the
TREC web site. The Mandarin Chinese text data is from the Xinhua News Agency and the
People's Daily News Service (both from mainland China). Click here to see the appereance
of a sample file from Xinhua Newswire and People's Daily. This collection of text
was originally gathered by the Linguistic Data Consortium (LDC), and then adapted
by the National Institute of Standards and Technology (NIST) for use in the TREC Mandarin
evaluation program.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rogers, Willie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T52
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631809
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T53
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 152-783-757-211-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Voice of America (VOA) Czech Broadcast News Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T53
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Voice of America (VOA) Czech Broadcast News Transcripts was developed by the University
of West Bohemia. The transcripts in this release correspond to Voice of America (VOA)
Czech Broadcast News Audio (LDC2000S89). Support for this work was provided by the
Ministry of Education of the Czech Republic (Grant No. VS97159); by the Ministry of
Education of the Czech Republic (Project ME293); and by the NSF Language Engineering
Workshop at the Johns Hopkins University, Baltimore, MD USA (NSF Grant No. IIS-9820687).
*Data* Between February 9 and May 28, 1999, the Linguistic Data Consortium (LDC) collected
approximately 30 hours of Czech broadcast audio from the Voice of America news service.
The 62 data files presented in this corpus represent the transcripts of the daily
broadcasts of 30-minute news programs. The transcriptions were created by native Czech
speakers, Pavel Ircing, Jindrich Matousek, Ludek Muller, and Vlasta Radova, working
at the Department of Cybernetics, University of West Bohemia in Pilsen under the direction
of Josef Psutka. They used transcription software provided by LDC (the "Transcriber"
package), developed by Eduoard Geoffrois and Claude Barras at DGA, France, with assistance
from Zhibiao Wu at LDC. The version of Transcriber used for this project produced
a text file format which is no longer supported by the software; also, the format
does not resemble any previous transcription format published by LDC. Therefore, the
files in this release have been converted into an SGML format that has been used for
other broadcast news transcription corpora, specifcally, the the "Universal Transcription
Format" (UTF -- not to be confused with the "Unicode Transformation Formats") defined
by the speech group at NIST (National Institute of Standards and Technology). A description
of that format is provided in the "utf.ps" (Postscript) and "utf.pdf" (Adobe Acrobat)
files, and the formal SGML definition is provided in "utf.dtd," all in the release
"doc" directory. The transcription text is rendered using the ISO 8859-2 character
set. Information relating this character set to the Unicode standard is available
at this site and from the Unicode Consortium. Due to technical limitations in the
hardware at LDC that was used to receive the VOA broadcasts via a satellite downlink,
a number of files contain brief portions where the audio signal was interrupted. These
interruptions typically yielded regions of complete silence that lasted less than
two seconds and were scattered sparsely throughout an affected audio file. Additional
markup was provided in the transcription texts to isolate the regions where these
interruptions occurred. Please click on LDC2000T53.sample to view an example transcript.
LANGUAGE NOTE
- Language note:
Content in Czech. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Psutka, J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Radova, V.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muller, L.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Matouse, J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ircing, P.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T53
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2000 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631892
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2000T54
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 742-218-645-985-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments (SPINE) Evaluation Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2000]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2000T54
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the Speech in Noisy Environments (SPINE) Evaluation Transcripts,
created for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC)
by Arcon Corp., and produced by the Linguistic Data Consortium (LDC) catalog number
LDC2000T54 and ISBN 1-58563-189-2. A companion corpus, Speech in Noisy Environments
(SPINE) Evaluation Audio, was also produced by the Linguistic Data Consortium (LDC);
catalog number LDC2000S96, ISBN 1-58563-188-4. These corpora support the 2000 Speech
in Noisy Environments evaluation. For an example transcript, please click here. The
2000 Speech in Noisy Environments Evaluation (SPINE1) is a first attempt to assess
the state of the art and practice in speech recognition technology in noisy military
environments and to exchange information on innovative speech recognition technology
in the context of fully implemented systems that perform realistic tasks. It is intended
to be of interest to all university, industrial and commercial speech system developers
working on the problem of robust speech recognition. The evaluation gives participants
the opportunity to participate in a flexible evaluation, suited to development needs
and abilities. This work was sponsored in part by National Science Foundation Grant
No. IIS-9982201. *Data* The SPINE1 evaluation focuses on the task of transcribing
speech produced in noisy environments with the emphasis on speech produced in noisy
military environments. The evaluation is designed to promote research progress in
this area, to provide the opportunity for participants to try out new ideas for developing
robust speech recognition systems that are of both scientific and practical interest,
and to measure the performance of this technology. More information on this evaluation
is available at SPINE1. The evaluation task is to transcribe speech produced in noisy
environments. The training and test speech data to be used for this evaluation were
generated by ARCON Corp. for the DoD Digital Voice Processing Consortium (DDVPC) under
controlled conditions. The speech data consists of conversations between two communicators
working on a collaborative, Battleship-like task in which they seek and shoot at targets
(ARCON Communicability Exercise, ACE). Participants may talk freely, but the total
vocabulary used is fairly limited. Each person is seated in a sound chamber in which
a previously recorded military background noise environment is accurately reproduced.
The participants use handsets and transmission channels that are resident to the particular
environment. The evaluation data includes 20 talker-pairs, with six five-minutes conversations
per talker-pair (about 600 minutes total), from a set of four scenarios
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Nielsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rennert, Kara
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2000T54
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632066
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 275-012-111-000-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments (SPINE2) Part 1 Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus was used as part of the training set for the Second Speech in Noisy Environments
Evaluation (SPINE2). SPINE2 provides a continuing forum for assessing the state of
the art and practice in speech recognition technology for noisy military environments
and for exchanging information on innovative speech recognition technology in the
context of fully implemented systems that perform realistic tasks. The evaluation
will provide researchers, potential sponsors, and customers with a quantitative means
to appreciate the strengths and weaknesses of the technologies. Also, the results
reported on will invite customer interest in the potential utility of the technologies.
More information on this evaluation is available here. This work was sponsored in
part by National Science Foundation Grant No. IIS-9982201. *Data* This publication
contains the Speech in Noisy Environments 2 (SPINE2) Clean and Vocoded Training Audio
Corpus created for the Department of Defense (DoD) Digital Voice Processing Consortium
(DDVPC) by Arcon Corp., and produced by the Linguistic Data Consortium (LDC) as catalog
number LDC2001S04 with ISBN 1-58563-206-6. The transcripts for this publication are
available as Speech in Noisy Environments (SPINE2) Training Transcripts LDC2001T05
with ISBN 1-58563-207-4. For an example transcript, please click here. These corpora
support the 2001 Speech in Noisy Environments evaluation. The training data comprises
two talker pairs (four speakers total) with 32 conversations (sessions) per talker
pair (64 conversations total). The audio for each session is presented in three forms:
* Unprocessed: the signal recorded at the participant's microphone * Bitstream: the
compressed "channel" data produced by the vocoder's analysis stage for transmission
from sender to receiver * Processed: the signal produced by the vocoder's synthesis
stage, given the bitstream data as input. There are a total of 64 clean audio files
and 64 vocoded files, one "game" each, for a rough total of seven hours of audio data,
1.6Gb (including the unprocessed, the processed, and the bitstream files), 20,850
total tokens (730 unique tokens).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Nielsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tardelli, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gatewood, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kreamer, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tremain, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tofan, Cristina
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632082
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 912-790-987-794-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments (SPINE2) Part 2 Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus was used as the development set for the Second Speech in Noisy Environments
Evaluation (SPINE2). SPINE2 provides a continuing forum for assessing the state of
the art and practice in speech recognition technology for noisy military environments
and for exchanging information on innovative speech recognition technology in the
context of fully implemented systems that perform realistic tasks. The evaluation
will provide researchers, potential sponsors, and customers with a quantitative means
to appreciate the strengths and weaknesses of the technologies. Also, the results
reported on will invite customer interest in the potential utility of the technologies.
More information on this evaluation is available here. This work was sponsored in
part by National Science Foundation Grant No. IIS-9982201. *Data* This publication
contains the Speech in Noisy Environments 2 (SPINE2) Clean and Vocoded Development
Audio Corpus created for the Department of Defense (DoD) Digital Voice Processing
Consortium (DDVPC) by Arcon Corp., and produced by the Linguistic Data Consortium
(LDC) as catalog number LDC2001S06 with ISBN 1-58563-208-2. The transcripts for this
publication are available as Speech in Noisy Environments (SPINE2) Development Transcripts
LDC2001T07 with ISBN 1-58563-209-0. For an example transcript, please click here.
These corpora support the 2001 Speech in Noisy Environments evaluation. The development
data comprises two talker pairs (four speakers total) with 16 conversations (sessions)
per talker pair (32 conversations total). The audio for each session is presented
in three forms: * Unprocessed: the signal recorded at the participant's microphone
* Bitstream: the compressed "channel" data produced by the vocoder's analysis stage
for transmission from sender to receiver * Processed: the signal produced by the vocoder's
synthesis stage, given the bitstream data as input. There are a total of 32 clean
audio files and 32 vocoded files, one "game" each, for a rough total of three and
a half hours (207 minutes) of audio data, 811Mb (including the unprocessed, the processed,
and the bitstream files), 9,700 total tokens (600 unique tokens).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Nielsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tardelli, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gatewood, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kreamer, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tremain, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tofan, Cristina
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632104
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 546-593-184-777-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments (SPINE2) Part 3 Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus was used as the evaluation set for the Second Speech in Noisy Environments
Evaluation (SPINE2). SPINE2 provides a continuing forum for assessing the state of
the art and practice in speech recognition technology for noisy military environments
and for exchanging information on innovative speech recognition technology in the
context of fully implemented systems that perform realistic tasks. The evaluation
will provide researchers, potential sponsors, and customers with a quantitative means
to appreciate the strengths and weaknesses of the technologies, and the results reported
on will invite customer interest in the potential utility of the technologies. More
information on this evaluation is available here. This work was sponsored in part
by National Science Foundation Grant No. IIS-9982201. *Data* This publication contains
the Speech in Noisy Environments 2 (SPINE2) Clean and Vocoded Evaluation Audio Corpus
created for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC)
by Arcon Corp., and produced by the Linguistic Data Consortium (LDC) as catalog number
LDC2001S08 with ISBN 1-58563-210-4. The transcripts for this publication are available
as Speech in Noisy Environments (SPINE2) Evaluation Transcripts LDC2001T09 with ISBN
1-58563-211-2. For an example transcript, please click here. These corpora support
the 2001 Speech in Noisy Environments evaluation. The evaluation data comprises 16
talker pairs (32 speakers total) with four conversations (sessions) per talker pair
(64 conversations total). The audio for each session is presented in three forms:
* Unprocessed: the signal recorded at the participant's microphone * Bitstream: the
compressed "channel" data produced by the vocoder's analysis stage for transmission
from sender to receiver * Processed: the signal produced by the vocoder's synthesis
stage, given the bitstream data as input. There are a total of 64 clean audio files
and 64 vocoded files, one "game" each, for a rough total of seven hours (423 minutes)
of audio data, 1.6Gb (including the unprocessed, the processed, and the bitstream
files), 23,300 total tokens (930 unique tokens).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Nielsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tardelli, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gatewood, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kreamer, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tremain, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tofan, Cristina
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632139
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 775-985-659-424-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Switchboard Cellular Part 1 Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Switchboard Cellular Part 1 Audio was developed by the Linguistic Data Consortium
(LDC) and consists of approximately 109 hours of English telephone conversations collected
by LDC between 1999-2000. The Switchboard cellular collection focused primarily on
GSM cellular phone technology. The project's goal was to target 190 subjects balanced
by gender and under varied environmental conditions to participate in (10+) five to
six minute conversations on GSM cellular phones. The speech data was collected for
research, development, and evaluation of automatic systems for speech-to-text conversion,
talker identification, language identification and speech signal detection purposes.
During the study period, LDC collected a total of 1,309 calls, or 2,618 sides (1,957
GSM), from 254 participants (129 male speakers, 125 female speakers) under varied
environmental conditions. *Data* This release contains speech data files with documentation
describing speaker information (sex, age, education, city and state where raised),
call information (date, time, call duration, Personal Identification Numbers, topic)
and audit information (channel quality, background noise). The data files are not
compressed. The documentation also contains reports on clipped files. Each speech
file consists of a 1,024-byte ASCII-formatted Sphere header, followed by two-channel
interleaved mu-law sample data. The mu-law samples represent the actual digital data
transmission from the telephone service provider (MCI), as captured separately for
each side of the telephone conversation by LDC's telephone collection platform. The
header also indicates the caller_pin, callee_pin, topic_id, cellular service/handset
information and speaker demographic information. Other releases in this series include:
Switchboard Cellular Part 1 Transcribed Audio (LDC2001S15) Switchboard Cellular Part
1 Transcription (LDC2001T14) Switchboard Cellular Part 2 Audio (LDC2004S07)
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632155
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 183-382-664-496-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Switchboard Cellular Part 1 Transcribed Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Switchboard Cellular Part 1 Transcribed Audio was developed by the Linguistic Data
Consortium (LDC) and consists of approximately 24 hours of English telephone conversations
collected by LDC between 1999-2000. This release contains the speech data files that
correspond to Switchboard Cellular Part 1 Transcription (LDC2001T14). The full set
of conversations (approximately 109 hours) from the Switchboard Part 1 study is available
in Switchboard Cellular Part 1 Audio (LDC2001S13). Switchboard Cellular Part 2 Audio
(LDC2004S07) contains approximately 200 hours of English telephone conversations collected
by LDC in the Switchboard Part 2 study. The Switchboard Part 1 cellular collection
focused primarily on GSM cellular phone technology. The project's goal was to target
190 subjects balanced by gender and under varied environmental conditions to participate
in (10+) five to six minute conversations on GSM cellular phones. The speech data
was collected for research, development, and evaluation of automatic systems for speech-to-text
conversion, talker identification, language identification and speech signal detection
purposes. *Data* Each speech file consists of a 1,024-byte ASCII-formatted Sphere
header, followed by two-channel interleaved mu-law sample data. The mu-law samples
represent the actual digital data transmission from the telephone service provider
(MCI), as captured separately for each side of the telephone conversation by LDC's
telephone collection platform. The header also indicates the caller_pin, callee_pin,
topic_id, cellular service/handset information and speaker demographic information.
The documentation also contains reports on clipped files.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632996
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004V01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 787-443-746-101-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
FORM1 Kinematic Gesture
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004V01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
FORM1 Kinematic Gesture was produced by Linguistic Data Consortium (LDC) catalog number
LDC2004V01 and ISBN 1-58563-299-6. FORM is a gesture annotation scheme designed to
capture the kinematic information in gesture from videos of speakers. This publication
is a detailed database of gesture-annotated videos stored in the Anvil and FORM file
formats. FORM encodes the "phonetics" of gesture by giving geometric descriptions
of location and movement of the right and left arms. Other kinematic information such
as effort and shape are also recorded. FORM gesture data has applications in statistical
natural language processsing, gesture recognition and generation, information extraction
from video, and human-computer interaction. Please go to the FORM website for more
information. The FORM2 publication was released in 2003 by the LDC and encoded much
of the same data provided here using a more recent tag set. *Data* This publication
contains gesture annotations created using the FORM 1.0 tag set. The Anvil annotation
files used in their creation are also included, as are 29.5 minutes of the original
audio and video recordings excerpted from a lecture given by Brian MacWhinney on January
24, 2000 at Carnegie Mellon University. A second data set, with 5.5 minutes of Paul
Howard telling a story in conversation while being motion captured, is also supplied.
These video recordings were chosen because they are part of the NSF-funded TalkBank
project. There are a total of 69 data files: 21 movie (.mov) files, 24 Anvil (.anvil)
files, and 24 FORM (.form1) files. The movie files are in Quicktime format with the
following specs: Size 360 x 240 pixels Compression H.261 Video rate 29.97 fps Audio
rate 48 kHz Audio format 8-bit/16-bit stereo Anvil files can be opened using the Anvil
video annotation tool, which is freely available from Michael Kipp. The .form file
format is an intermediate data format that contains only the FORM2 values from each
.anvil in a comma-delimited, frame-by-frame listing of the following form: frame,upper_arm_lift,forearm_orientation,handshape,wrist_up_down,wrist_side_side,effort,tension
*Sponsorship* This research was conducted using funding from the following grant sources:
ISLE - 9910603 NSF: TalkBank (via subcontract from Carnegie Mellon University) - BCS-998009
and BCS-9978056 NSF: Discourse and Gesture w/ Joshi, Liberman, and Martell - EIA98-09209
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martell, Craig
ADDED ENTRY--PERSONAL NAME
- Personal name:
Osborn, Chris
ADDED ENTRY--PERSONAL NAME
- Personal name:
Britt, Lisa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Myers, Kari
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004V01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634018
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 647-896-139-023-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
afb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Gulf Arabic Conversational Telephone Speech, Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Gulf Arabic Conversational Telephone Speech, Transcripts is a database containing
transcripts of 975 Gulf Arabic speakers taking part in spontaneous telephone conversations
in Colloquial Gulf Arabic. A total of 976 conversation sides are provided (one speaker
appears on two distinct calls). The average duration per side is about 5.7 minutes.
The data was collected and transcribed in 2004 by Appen Pty Ltd., Sydney, Australia.
Each transcript file is a tab-delimited flat table, where each line contains information
and text for a single contiguous utterance, presented via the following fields: *
beginning time stamp in seconds, in square brackets ("[5.7189]") * ending time stampe
in seconds, in square brackets * channel/speaker-ID ("A:" or "B:") * "consonant skeleton"
orthography for the utterance, in UTF-8 * "diacritized" orthography for the utterance,
in ASCII The ASCII field is the Buckwalter transliteration of the fully "vowelized"
(pronunciation) form of the utterance. Within fields 4 and 5, word boundaries are
marked by space characters in the normal way, following common practices of Arabic
orthographic convention (e.g. all definite articles and many conjunctions and prepositions
are attached as prefixes to the following word). Transcript tokens enclosed in single
parentheses -- e.g. "(DHk)" -- represent annotation marks for non-speech events or
conditions, such as laughter, noise, etc. Multi-token strings within single parentheses
involve words in some other language (typically English) or some other Arabic dialect.
Double parentheses, either with or without tokens enclosed within them -- e.g. "(())",
"((word))" or "((word1 word2))" -- represent regions where the transcriber was unable
to tell for sure what was said. The "consonant skeleton" orthography is intended to
reflect common orthographic practice in written Arabic (i.e. Modern Standard Arabic
(MSA)), but without being bound strictly by the specific spellings of MSA words. That
is, there may be novel (dialect-specific) words and changes of consonant quality (hence
altered spelling) in words that are cognate between MSA and Gulf Arabic. The "vowelized"
orthography is restricted to a character set that allows words to be rendered coherently
in Arabic script (with all diacritics present as needed to represent short vowels,
etc), but is intended to reflect the perceived pronunciation of each token. As a result,
a given word (type), having a multiple occurrences in the text with identical "skeletal"
spellings, may have multiple distinct "vowelized" spellings. In some cases, these
different spellings simply reflect pronunciation variants, while in other cases, they
represent distinct morphological forms (with distinct contextual meanings) where the
semantic differences are conveyed solely by the the short vowels (i.e. the diacritics).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Gulf Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Appen Pty Ltd, Sydney, Australia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u bai d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632163
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 147-689-240-962-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
bai
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jgo
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Grassfields Bantu Fieldwork: Ngomba Tone Paradigms was produced by Linguistic Data
Consortium (LDC) catalog number LDC2001S16 and ISBN 1-58563-216-3. Please see below
for information regarding its collection, processing, and contents. The data contains
tone paradigms of the language Ngomba, a Bamileke (Grassfields Bantu) language spoken
by some 63,000 people in the Western Province of Cameroon. Ngombas tone system is
undescribed, but it has many similarities with the closely related Yémba language
(also known as Bamileke Dschang). *Data* This publication contains 755 audio files.
The files in rawdata are 21 extended audio and laryngograph recordings with ESPS xlabel
files each one of the raw sound files contains the complete recording of one of the
tenses. The files in paradigms are HTML indexes linked to 734 one to three second
audio clips in .wav format. Each HTML page lists 32 utterances, varying across subject,
verb, and object. Transcriptions are provided for the audio clips using the IPA-based
orthography, and using phonetic and tonological transcription systems. Recorded: June
21, 1997, Recording Studio of SIL Cameroon, Yaoundé Digitized, Labelled and Segmented:
1997-1998 Phonetics Laboratory, University of Edinburgh Transcribed and Annotated:
1998-2001 LDC, University of Pennsylvania Sponsorship: SIL Cameroon Economic and Social
Research Council (UK) Grant R000235540 National Science Foundation (US) Grant 9983258
National Science Foundation (US) TalkBank Project Grant BCS-998009, KDI, SBE Linguistic
Data Consortium
LANGUAGE NOTE
- Language note:
Content in Ngomba. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bird, Steven
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bell, John
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631825
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S91
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 639-420-515-411-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 HUB4 Broadcast News Evaluation Non-English Test Material
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S91
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
1997 HUB4 Broadcast News Evaulation Non-English Test Material was developled by the
Linguistic Data Consortium. It contains the evaluation test material used in the 1997
DARPA/NIST Continuous Speech Recognition Broadcast News HUB4 Non-English Benchmark
Test administered by the NIST Spoken Natural Language Processing Group. *Data* The
test material is contained in two SPHERE-formatted waveform files. The file h4ne97sp.sph
(set1) contains one hour of Spanish broadcast news excerpts from 1997. The file h4ne97ma.sph
(set2) contains one hour of Mandarin broadcast news excerpts from 1997. Each file
should be separately recognized per the HUB4 Non English Evaluation Specification.
Note: 1997 HUB4 English evaluation material is contained in 1997 HUB4 English Evaluation
Speech and Transcripts LDC2002S11.
LANGUAGE NOTE
- Language note:
Content in Spanish and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S91
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631841
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S93
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 692-669-833-783-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT2 Mandarin Audio Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S93
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Topic Detection and Tracking (TDT) 2 Mandarin Audio Corpus contains approximatley
52 hours of recordings of broadcast news audio. The transcriptions to these recordings
are available in the Topic Detection and Tracking (TDT) 2 Multilanguage Text Version
4.0, LDC2001T57. Topic Detection and Tracking (TDT) refers to automatic techniques
for finding topically related material in streams of data such as newswire and broadcast
news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous
sections (segmentation), detect the occurrrence of new events (detection), and track
the reoccurrencce of old or new events (tracking). *Data* Please see file.tbl for
the directory structure of this publication, as well as a complete list of files.
The data files are recordings of Voice of America (VOA) news broadcasts. The data
were collected daily over a period of six months (February-June 1998). The audio files
in this corpus are single channel, 16 KHz, 16-bit linear SPHERE files.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S93
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S94
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 268-739-703-464-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT3 English Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S94
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically
related material in streams of data such as newswire and broadcast news. The TDT3
corpus was created to support three TDT3 tasks: find topically homogeneous sections
(segmentation), detect the occurrence of new events (detection), and track the reoccurrence
of old or new events (tracking). The goal of Topic Detection and Tracking - Phase
3 (TDT3) is to create core technology to monitor multiple streams of news in multiple
languages and media (newswire, radio, television, web sites or some future combination
or innovation), segmenting the streams into individual stories, detecting new topics
and tracking all stories discussing them. In additional to the TDT2 tasks of segmentation,
detection and tracking, TDT3 adds the tasks of first story detection and story-link
detection. The goal of the latter is to detect links between stories that discuss
the same topic even though the topic has not been defined in advance. *Data* The TDT3
English Audio Corpus contains the audio (in compressed sphere format) of news broadcasts
collected daily from six news sources in American English, over a three-month collection
period (October - December 1998). The sources and amounts are as follows: Sources
Hours CDs ------------------------------------------------------------------ CNN_HDL
Cable News Network, "Headline News" 174.6 19 ABC_WNT American Broadcasting Co., "World
News Tonight" 38.6 5 NBC_NNW National Broadcasting Co., "NBC Nightly News" 44.6 6
MNB_NBW MS-NBC, "News with Brian Williams" 51.8 6 PRI_TWD Public Radio International,
"The World" 63.9 7 VOA_ENG Voice of America, English news programs 102.2 12 Total
475.7 55 The files in this publication are complete single-channel recordings of the
(30 or 60-minute) broadcasts listed above. Each one has been digitized at a sample
rate of 16 KHz using 16-bit samples, and compressed using the "shorten" algorithm.
(The audio CD-ROMs are grouped into subsets by broadcast source and the LDC will support
the option of purchasing one or more subsets, e.g. just the ABC data. We regret that
we cannot provide "customized" subsets.) Tools for decompression can be found here.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S94
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631868
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S95
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 553-735-328-637-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT3 Mandarin Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S95
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the TDT3 Broadcast News Mandarin Corpus (Audio), produced
by the Linguistic Data Consortium (LDC), catalog number LDC2001S95 and ISBN number
1-58563-186-8. The contents of this publication were recorded from various 60-minute,
twice daily Mandarin news programs from VOA, amounting to approximately 123 hours
of audio. The transcripts of these broadcasts will be published in the TDT3 Mandarin
Text and TDT3 Multilanguage Text Corpora. *Data* Topic Detection and Tracking (TDT)
refers to automatic techniques for finding topically related material in streams of
data such as newswire and broadcast news. The TDT3 corpus was created to support three
TDT3 tasks: find topically homogeneous sections (segmentation), detect the occurrence
of new events (detection), and track the reoccurrence of old or new events (tracking).
The goal of Topic Detection and Tracking - Phase 3 (TDT3) is to create core technology
to monitor multiple streams of news in multiple languages and media (newswire, radio,
television, web sites or some future combination or innovation), segmenting the streams
into individual stories, detecting new topics and tracking all stories discussing
them. In additional to the TDT-2 tasks of segmentation, detection and tracking, TDT3
adds the tasks of first story detection and story-link detection. The goal of the
latter is to detect links between stories that discuss the same topic even though
the topic has not been defined in advance. Please see file.tbl for the directory structure
of this publication, as well as a complete list of files. The data files are recordings
of Voice of America (VOA) news broadcasts. The data were collected daily over a period
of three months (October-December 1998). The audio files in this corpus are single
channel, 16 KHz, 16-bit linear SPHERE files.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S95
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631922
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S97
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 919-164-226-906-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2000 NIST Speaker Recognition Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S97
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the 2000 NIST Speaker Recognition Evaluation Corpus, Linguistic
Data Consortium (LDC) catalog number LDC2001S97 and ISBN 1-58563-192-2. The 2000 NIST
Speaker Recognition Evaluation is part of an ongoing series of yearly evaluations
conducted by NIST. These evaluations provide an important contribution to the direction
of research efforts and the calibration of technical capabilities. They are intended
to be of interest to all researchers working on the general problem of text independent
speaker recognition. To this end, the evaluation was designed to be simple, to focus
on core technology issues, to be fully supported, and to be accessible. *Data* This
publication consists of 10,328 single channel SPHERE files encoded in 8-bit mulaw
containing a total of approximately 4.31 Gbytes of data covering 148.9 hours of conversational
telephone speech collected by LDC. Supporting documentation for this evaluation may
be found on the 2000 NIST Speaker Recognition Evaluation website. Please note that
there was an optional additional corpus in the original Evaluation. If you are interested
in this AHUMADA corpus, please contact Javier Ortega-Garcia of the Universidad Politecnica
de Madrid. Information on how to contact Dr. Ortega-Garcia is available at 2000 NIST
Resources.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S97
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632007
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001S99
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 688-490-070-931-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001S99
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the Speech in Noisy Environments 1 (SPINE1) Coded Audio
Corpus created for the Department of Defense (DoD) Digital Voice Processing Consortium
(DDVPC) by Arcon Corp., and produced by the Linguistic Data Consortium (LDC) as catalog
number LDC2001S99 with ISBN 1-58563-200-7. The transcripts for this publication are
available as Speech in Noisy Environments (SPINE1) Training Transcripts LDC2000T49
and Speech in Noisy Environments (SPINE1) Evaluation Transcripts LDC2000T54. This
work was sponsored in part by National Science Foundation Grant No. IIS-9982201. *Data*
For an example transcript, please click here. There are a total of 253 files, one
"game" each, for a rough total of 19 hours and 28 minutes (~4.4Gb) of audio data.
This corpus will be used as part of the training set for the Second Speech in Noisy
Environments Evaluation (SPINE2). SPINE2 will provide a continuing forum for assessing
the state of the art and practice in speech recognition technology for noisy military
environments and for exchanging information on innovative speech recognition technology
in the context of fully implemented systems that perform realistic tasks. The evaluation
will provide researchers, potential sponsors, and customers with a quantitative means
to appreciate the strengths and weaknesses of the technologies, and the results reported
on will invite customer interest in the potential utility of the technologies. More
information on this evaluation is available here.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Nielsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tardelli, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gatewood, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kreamer, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tremain, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tofan, Cristina
ADDED ENTRY--PERSONAL NAME
- Personal name:
Soo, Kai Shun
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001S99
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632058
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 783-262-033-141-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Message Understanding Conference (MUC) 7
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Message Understanding Conference (MUC) 7 was produced by Linguistic Data Consortium
(LDC) catalog number LDC2001T02 and ISBN 1-58563-205-8. In the 1990s, the MUC evaluations
funded the development of metrics and statistical algorithms to support government
evaluations of emerging information extraction technologies. Additional information
from NIST can be found here. *Data* The following list shows the correspondence between
versions of the IE task definition and stages of the MUC-7 evaluation. Version #Stage
4.1 training and dryrun 4.2 formalrun 5.1 final The dryrun and formalrun have different
domains; the dryrun (and training) consists of aircrashes scenarios and the formalrun
consists of missile launches scenarios. The final version updates especially the Template
Relations portion of the guidelines. Normally, for each scenario, two datasets are
provided: training and test. When the evaluation cycle begins, the label for the scenario
dataset is training. Then the corresponding test dataset for that same scenario is
used for the dryrun testing. For the formal run, a formal training set is given out
four weeks before the test answers are due. The formal test is given out one week
before the test answers are due. After the entire evaluation and meeting have been
held, final edits are made if necessary.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chinchor, Nancy
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632074
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 525-834-904-967-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments (SPINE2) Part 1 Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus was used as part of the training set for the Second Speech in Noisy Environments
Evaluation (SPINE2). SPINE2 provides a continuing forum for assessing the state of
the art and practice in speech recognition technology for noisy military environments
and for exchanging information on innovative speech recognition technology in the
context of fully implemented systems that perform realistic tasks. The evaluation
will provide researchers, potential sponsors, and customers with a quantitative means
to appreciate the strengths and weaknesses of the technologies. Also, the results
reported on will invite customer interest in the potential utility of the technologies.
More information on this evaluation is available here. This work was sponsored in
part by National Science Foundation Grant No. ISS-9982201. *Data* This publication
contains the Speech in Noisy Environments 2 (SPINE2) Training Transcripts, created
for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC) by
Arcon Corp., and produced by the Linguistic Data Consortium (LDC) as catalog number
LDC2001T05 with ISBN 1-58563-207-4. For an example transcript, please click here.
The audio for this publication is available as Speech in Noisy Environments (SPINE2)
Training Audio LDC2001S04, ISBN 1-58563-206-6. These corpora support the 2001 Speech
in Noisy Environments evaluation. The training data comprises two talker pairs (four
speakers total) with 32 conversations (sessions) per talker pair (64 conversations
total). The audio for each session is presented in three forms: * Unprocessed: the
signal recorded at the participant's microphone * Bitstream: the compressed "channel"
data produced by the vocoder's analysis stage for transmission from sender to receiver
* Processed: the signal produced by the vocoder's synthesis stage, given the bitstream
data as input. There are a total of 64 clean audio files and 64 vocoded files, one
"game" each, for a rough total of seven hours of audio data, 1.6Gb (including the
unprocessed, the processed, and the bitstream files), 20,850 total tokens (730 unique
tokens).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gatewood, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kreamer, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tremain, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Neilsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tardelli, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tofan, Cristina
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632090
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 199-327-669-774-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments (SPINE2) Part 2 Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus was used as the development set for the Second Speech in Noisy Environments
Evaluation (SPINE2). SPINE2 provides a continuing forum for assessing the state of
the art and practice in speech recognition technology for noisy military environments
and for exchanging information on innovative speech recognition technology in the
context of fully implemented systems that perform realistic tasks. The evaluation
will provide researchers, potential sponsors, and customers with a quantitative means
to appreciate the strengths and weaknesses of the technologies, and the results reported
on will invite customer interest in the potential utility of the technologies. More
information on this evaluation is available here. This work was sponsored in part
by National Science Foundation Grant No. IIS-9982201. *Data* This publication contains
the Speech in Noisy Environments 2 (SPINE2) Development Transcripts, created for the
Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC) by Arcon Corp.,
and produced by the Linguistic Data Consortium (LDC) as catalog number LDC2001T07
with ISBN 1-58563-209-0. For an example transcript, please click here. The audio for
this publication is available as Speech in Noisy Environments (SPINE2) Development
Audio LDC2001S06, ISBN 1-58563-208-2. These corpora support the 2001 Speech in Noisy
Environments evaluation. The development data comprises two talker pairs (four speakers
total) with 16 conversations (sessions) per talker pair (32 conversations total).
The audio for each session is presented in three forms: * Unprocessed: the signal
recorded at the participant's microphone * Bitstream: the compressed "channel" data
produced by the vocoder's analysis stage for transmission from sender to receiver
* Processed: the signal produced by the vocoder's synthesis stage, given the bitstream
data as input. There are a total of 32 clean audio files and 32 vocoded files, one
"game" each, for a rough total of three and a half hours (207 minutes) of audio data,
811Mb (including the unprocessed, the processed, and the bitstream files), 9,700 total
tokens (600 unique tokens).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Nielsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tardelli, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gatewood, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kreamer, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tremain, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tofan, Cristina
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632112
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 653-626-872-106-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech in Noisy Environments (SPINE2) Part 3 Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus was used as the evaluation set for the Second Speech in Noisy Environments
Evaluation (SPINE2). SPINE2 provides a continuing forum for assessing the state of
the art and practice in speech recognition technology for noisy military environments
and for exchanging information on innovative speech recognition technology in the
context of fully implemented systems that perform realistic tasks. The evaluation
will provide researchers, potential sponsors, and customers with a quantitative means
to appreciate the strengths and weaknesses of the technologies, and the results reported
on will invite customer interest in the potential utility of the technologies. More
information on this evaluation is available here. This work was sponsored in part
by National Science Foundation Grant No. IIS-9982201. *Data* This publication contains
the Speech in Noisy Environments 2 (SPINE2) Evaluation Transcripts, created for the
Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC) by Arcon Corp.,
and produced by the Linguistic Data Consortium (LDC) as catalog number LDC2001T09
with ISBN 1-58563-211-2. For an example transcript, please click here. The audio for
this publication is available as Speech in Noisy Environments (SPINE2) Training Audio
LDC2001S08, ISBN 1-58563-210-4. These corpora support the 2001 Speech in Noisy Environments
evaluation. The evaluation data comprises 16 talker pairs (32 speakers total) with
four conversations (sessions) per talker pair (64 conversations total). The audio
for each session is presented in three forms: * Unprocessed: the signal recorded at
the participant's microphone * Bitstream: the compressed "channel" data produced by
the vocoder's analysis stage for transmission from sender to receiver * Processed:
the signal produced by the vocoder's synthesis stage, given the bitstream data as
input. There are a total of 64 clean audio files and 64 vocoded files, one "game"
each, for a rough total of seven hours (423 minutes) of audio data, 1.6Gb (including
the unprocessed, the processed, and the bitstream files), 23,300 total tokens (930
unique tokens).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidt-Nielsen, Astrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marsh, Elaine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tardelli, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gatewood, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kreamer, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tremain, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tofan, Cristina
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632120
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 552-093-753-963-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Prague Dependency Treebank 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Prague Dependency Treebank Version 1.0: * Morphologically and syntactically annotated
Czech data, 1.8MW * Czech-English parallel Corpus, aligned, 0.9MW/1MW * Czech raw
texts (newspaper and journals), over 30MW * Czech NLP tools (morphology, tagging)
* General annotation tools (tree editors, tree viewer) (abridged version of the part
of paper: E. Hajicova. Dependency-Based Underlying-Structure Tagging of a Very Large
Czech Corpus) Since a group of Czech linguists (Institute of Formal and Applied Linguistics,
Institute of Theoretical and Computational Linguistcs) from Charles University in
Prague and Masaryk University in Brno first formulated the Czech National Corpus,
it has been quite clear to all of us that for the outcome of our project to have broader
relevance and multifaceted usage, we cannot confine ourselves to a mere compilation
of a very large corpus of Czech texts. We have been aware that in order to make the
corpus really useful for future users -- be they linguists or developers of natural
language processing systems of any kind -- we have to design annotation schemes and
develop tools that will allow us to add as much linguistic information as possible.
Having the advantage of a long and fruitful tradition of theoretical and computational
linguistics and inspired by the research resulting in the Penn Treebank, the project
group decided to build the Prague Dependency Treebank (PDT). *Data* The following
three points are characteristic for the theory underlying the PDT, fully visible at
the highest, tectogrammatical level: (i) Its theoretical background is a dependency-based
syntax (handling the sentence structure as concentrated around the verb and its valency,
but containing a further dimension, namely coordination). Among the reasons for the
choice of a dependency-based syntax, we primarily stress its relative economy and
perspicuous, immediate correspondence to the empirical data. (ii) The nodes of the
dependency tree (more precisely, of a multidimensional network) are labeled by complex
symbols consisting of lexical, morphological and syntactic parts. Thus, the label
of every node contains symbols expressing all of the information contained in the
grammatical position of this word and is relevant for a semantic (semantico-pragmatic)
interpretation. This makes the output representations, or the trees of our treebank,
not only useful for practical applications such as parsing, but also for its inclusion
into an integrated theoretical description encompassing all layers from the outer
(phonetic or graphemic) shape of the sentence to its semantico-pragmatic representation,
be it in the form of truth-conditionally based intensional semantics or in that of
a framework paying more attention to the embedding of the sentence in context. (iii)
The dependency tree is understood as projective. Its relationships to the morphemic
representation of the sentence (a string of symbols, the order of which corresponds
to the surface word order) are handled by means of specific rules. Prague Dependency
Treebank as a project The Prague Dependency Treebank (PDT) is a long-term project
with two major phases. In the first phase (1996-2000), the morphological and syntactic
analytic layers of annotation have been completed and made together with the preview
of tectogrammatical layer annotation available as PDT 1.0. During the second phase
(2000 - 2004, Center for Computational Linguistics), the tectogrammatical layer of
annotation will proceed and the PDT 2.0 will be available upon completion. The structure
of the Prague Dependency Treebank (PDT) corresponds to a three-layer structure annotated
corpus of Czech as a representative of inflectionally rich, free word-order languages:
* Morphological layer (lowest) - Full morphological annotation * Analytic layer (middle)
- Superficial (surface) syntactic annotation using dependency treebank with a level
conceptually close to the syntactic annotation used in the Penn Treebank * Tectogrammatical
layer (highest) - Level of linguistic meaning Text Sources The electronic text sources
have been provided by the Institute of the Czech National Corpus.The text material
contains samples from the following sources: * Lidové Noviny (daily newspapers), 1991,
1994, 1995 * Mladá fronta Dnes (daily newspapers), 1992 * Ceskomoravský Profit (business
weekly), 1994 * Vesmír (scientific magazine), Academia Publishers, 1992, 1993 There
is also a parallel Czech English corpus. Drawn from Readers Digest 1993-1996, it consists
of 450 articles, 53,117 parallel sentences, 1,010,346 English tokens and 877,658 Czech
tokens Inner format of PDT There are two internal formats employed in PDT: FS and
CSTS. The former is an older format, still heavily used by some treebank tools. The
latter, more general SGML-based encoding, is meant as the main PDT format (in the
future, it will be followed by an XML version, probably already for PDT 2.0). See
the description of the FS file format and documentation of the CSTS document type
definition (csts.dtd). Prague Dependency Treebank Version 1.0 PDT 0.5 (half through)
was released in 1998 and contains 456,705 tokens (words and punctuation) in 26,610
sentences. PDT 1.0 contains about three times more tokens and sentences than PDT 0.5.
It is completely manually-annotated on the morphological and analytical levels and
includes a preview of tectogrammatically annotated data as well. Future The Prague
Dependency Treebank Version 2.0 will add the tectogrammatical layer of annotation
to PDT 1.0. It will be available with a reduced amount of data as preliminary Version
1.5 during 2002. The final data volume will be reached at the end of 2004. Support
The PDT 1.0 has been supported by the following grants and projects * Grant Agency
of the Czech Republic No. 405/96/0198 (Treebank Definition and Procedures Specification)
* Grant Agency of the Czech Republic No. 405/96/K214 (Tools and Morphological Layer
Annotation) * Ministry of Education of the Czech Republic No. VS96151 (Tools and Structural
Annotation on the Analytical Layer) * National Science Foundation No. IIS-9732388
(Version 0.5 Preparation for the Workshop 98) The PDT 2.0 will be supported by the
project * Ministry of Education of the Czech Republic No. LN00A063 (Center for Computational
Linguistics)
LANGUAGE NOTE
- Language note:
Content in Czech and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajič, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajičová, Eva
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pajas, Petr
ADDED ENTRY--PERSONAL NAME
- Personal name:
Panevová, Jarmila
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sgall, Petr
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636487
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 460-638-744-650-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Proposition Bank 3.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Proposition Bank 3.0 is a continuation of the Chinese Proposition Bank project
which aims to create a corpus of text annotated with information about basic semantic
propositions. Chinese Proposition Bank 3.0 adds predicate-argument annotation on 187,731
words from Chinese Treebank 7.0 (LDC2010T07). The data sources are comprised of newswire,
magazine articles, various broadcast news and broadcast conversation programming,
web newsgroups and weblogs. LDC has also released Chinese Proposition Bank 1.0 (LDC2005T23)
and Chinese Proposition Bank 2.0 (LDC2008T07). *Data* This release contains the predicate-argument
annotation of 173,206 verb instances and 14,525 noun instances. The annotation of
nouns is limited to nominalizations that have a corresponding verb. The general annotation
guidelines and the lexical guidelines (called frame files) for each verbal and nominal
predicate are also included in this release. Below are some statistics about the corpus.
* Total propositions for verbs - 173,206 * Total propositions for nouns - 14,525 *
Total verbs framed - 24,642 * Total framesets - 26,467 * Verbs with multiple framesets
- 1337 * Average framesets per verb - 1.07 * Total nouns framed - 1,421 * Total noun
framesets - 1,528 * Nouns with multiple framesets - 48 * Average framesets per nouns
- 1.08
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bai, Xiaopeng
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zhang, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chang, Meiyu
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zhong, Hua
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636495
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 405-302-338-505-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 was developed
by the Linguistic Data Consortium (LDC) and contains 115,826 tokens of word aligned
Arabic and English parallel text with treebank annotations. This material was used
as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.
Parallel aligned treebanks are treebanks annotated with morphological and syntactic
structures aligned at the sentence level and the sub-sentence level. Such data sets
are useful for natural language processing and related fields, including automatic
word alignment system training and evaluation, transfer-rule extraction, word sense
disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic
studies. With respect to machine translation system development, parallel aligned
treebanks may improve system performance with enhanced syntactic parsers, better rules
and knowledge about language pairs and reduced word error rate. In this release, the
source Arabic data was translated into English. Arabic and English treebank annotations
were performed independently. The parallel texts were then word aligned. The material
in this corpus corresponds to a portion of the Arabic treebanked data in Arabic Treebank
- Broadcast News v1.0 (LDC2012T07). *Data* The source data consists of Arabic broadcast
news programming collected by LDC in 2005 and 2006 from Alhurra, Aljazeera and Dubai
TV. All data is encoded as UTF-8. A count of files, words, tokens and segments is
below. Language Files Words Tokens Segments Arabic 28 89,213 115,826 4,824 Note: Word
count is based on the untokenized Arabic source. Token count is based on the ATB-tokenized
Arabic source. The purpose of the GALE word alignment task was to find correspondences
between words, phrases or groups of words in a set of parallel texts. Arabic-English
word alignment annotation consisted of the following tasks: * Identifying different
types of links: translated (correct or incorrect) and not translated (correct or incorrect)
* Identifying sentence segments not suitable for annotation, e.g., blank segments,
incorrectly-segmented segments, segments with foreign languages * Tagging unmatched
words attached to other words or phrases This release contains four types of files
- raw, tokenized, treebank, and wa. The raw format contains the original Arabic and
English sentences without any annotation. The tokenized format is the treebank tokenized
version of the raw data which may contain Empty Category tokens (treebank leaves that
have the POS label -NONE-). The treebank and wa files are treebank and word alignment
annotations on the tokenized files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u chi d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 324-683-461-517-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Treebank 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Published by the Linguistic Data Consortium (LDC), catalog number LDC2001T11 and ISBN
1-58563-204-X. The Chinese Penn Treebank Project started in Summer 1998. The goal
is the creation of a 100,000 word corpus of Chinese with syntactic bracketing. More
information is available at The Chinese Treebank Project. Chinese Treebank 2.0 supersedes
and replaces the Chinese Penn Treebank Final Release (LDC2000T48 ISBN 1-58563-187-6).
*Data* Size: About 100K words, 325 data files Source: 325 articles from Xinhua newswire
between 1994 and 1998 Coding: GB code Format: Same as the UPenn English Treebank except
that we keep some original file information was retained such as "SRCID" and "DATE"
in the data file. Annotation: All the files are annotated at least twice, the first-pass
is done by one annotator, and the resulting files are checked by the second annotator
(second-pass). SGML: All data files validate against chtb.dtd using nsmls. The files
are located in the data subdirectory and are sequentially named as follows: chtb_nnn.fid
where nnn is the sequential file number. There is a cross reference in file.tbl which
provides some annotator and historical information. More extensive documentation,
including samples of the annotated data, can be found at http://www.cis.upenn.edu/~chinese.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitch
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kroch, Tony
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chiou, Fu-Dong
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632147
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 531-296-803-579-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Switchboard Cellular Part 1 Transcription
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Switchboard Cellular Part 1 Transcription was developed by the Linguistic Data Consortium
(LDC) and consists of transcripts of approximately 24 hours of English telephone conversations
collected by LDC between 1999-2000. The corresponding audio files are contained in
Switchboard Cellular Part 1 Transcribed Audio (LDC2001S15). *Data* This release consists
of 250 talker pairs (250 speakers total) with one tracnscript (session) per talker
pair for a total of 250 conversations. The documentation included with this release
includes information on how calls were selected for transcription and on the specification
used to transcribe the audio files. *Sample* For an example transcript please click
here.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631906
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T55
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 013-368-610-633-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Newswire Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T55
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the Arabic Newswire A Corpus, Linguistic Data Consortium
(LDC) catalog number LDC2001T55 and ISBN 1-58563-190-6. The Arabic Newswire Corpus
is composed of articles from the Agence France Presse (AFP) Arabic Newswire. The source
material was tagged using TIPSTER-style SGML and was transcoded to Unicode (UTF-8).
The corpus includes articles from May 13, 1994 to December 20, 2000. *Data* The data
is in 2,337 compressed (zipped) Arabic text data files. There are 209 Mb of compressed
data (869 Mb uncompressed) with approximately 383,872 documents containing 76 million
tokens over approximately 666,094 unique words. A template of the tagging is presented
below. yyyymmdd_AFP_ARB.dddd Arabic Text Arabic TextOne or More Paragraphs of Arabic
Text Arabic Text Arabic Text For a sample file of tagged articles, please see this
sample.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T55
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631833
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T57
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 662-457-089-041-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT2 Multilanguage Text Version 4.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T57
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically
related material in streams of data such as newswire and broadcast news. The TDT2
corpus was created to support three TDT2 tasks: find topically homogeneous sections
(segmentation), detect the occurrrence of new events (detection), and track the reoccurrencce
of old or new events (tracking). *Data* TDT2 Multilanguage Text Corpus Version 4.0
contains news data collected daily from nine news sources in two languages (American
English and Mandarin Chinese), over a period of six months (January - June 1998).
Both manually-created reference text and automatically- generated text (ASR and/or
machine translation) are provided for all broadcast and all Mandarin data. This version
has been prepared to complement the first general release of the TDT3 Multilanguage
Text Corpus, providing new enhancements to make the data content more accessible to
a broader research community. The news sources and approximate number of stories per
source (in thousands) are as follows: English sources (thousands of stories) New York
Times Newswire Service 11.8 Associated Press Worldstream Service 12.8 Cable News Network,
Headline News 15.8 American Broadcasting Co., World News Tonight 2.1 Public Radio
International, The World 2.9 Voice of America (news programs) 8.2 Total English stories:
53.6 thousand) Mandarin sources (thousands of stories) Xinhua News Agency 11.3 Zaobao
News Agency 5.2 Voice of America (news programs) 2.3 Total Mandarin stories: 18.8
thousand This release consists of the English and Mandarin text components of the
TDT2 corpus. The data was collected daily over a period of six months (January-June
1998) from the following sources. * American Broadcasting Company (ABC) * Associated
Press * Cable News Network, Inc. (CNN) * New York Times * Public Radio International
(PRI) * Voice of America (VOA) * Xinhua News Agency * ZaoBao News The data is provided
in the following formats. .sgm: Reference true-text, with markup providing story boundaries
and descriptive information .tkn: Tokenized version of sgml data, with all descriptive
and boundary information removed .as0: Output of the Dragon ASR system in tokenized
form with information on timing, speaker clusters, and confidence .as1: Output of
the BBN ASR system in tokenized form with timing information (English Only) .mttkn:
SYSTRAN output from .tkn (Mandarin Only) .mtas0: SYSTRAN output from .as0 (Mandarin
Only) The corpus also includes topic relevance tables as well as tables for locating
story boundaries.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liberman, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alabiso, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wayne, Charles
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T57
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631930
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T58
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 567-570-426-915-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT3 Multilanguage Text Version 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T58
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically
related material in streams of data such as newswire and broadcast news. The TDT3
corpus was created to support three TDT3 tasks: to find topically homogeneous sections
(segmentation), to detect the occurrence of new events (detection) and to track the
reoccurrence of old or new events (tracking). *Data* TDT3 Multilanguage Text Corpus
Version 2.0 is the first general release of this collection (Version 1.0 was made
available only to participants in the TDT 1999 and 2000 evaluation tests). It contains
data from the same nine sources found in TDT2, plus two additional English television
sources. Like TDT2, it provides both manually-created and automatically-generated
text for most sources. For TDT3, the daily collection took place over a period of
three months (October - December 1998). The sources and approximate number of stories
per source are as follows: English sources Thousands of stories New York Times Newswire
Service 6.9 Associated Press Worldstream Service 7.3 Cable News Network, "Headline
News" 9.0 American Broadcasting Co., "World News Tonight" 1.0 Public Radio International,
"The World" 1.6 Voice of America, English news programs 3.9 MS-NBC, "News with Brian
Williams" 0.7 National Broadcasting Co., "NBC Nightly News" 0.8 Total English stories:
31.2 thousand Mandarin sources Thousands of stories Xinhua News Agency 5.2 Zaobao
News Agency 3.8 Voice of America, Mandarin Chinese news programs 3.8 Total Mandarin
stories: 12.8 thousand The goal of Topic Detection and Tracking - Phase 3 (TDT3) is
to create core technology to monitor multiple streams of news in multiple languages
and media (newswire, radio, television, web sites or some future combination or innovation),
segmenting the streams into individual stories, detecting new topics and tracking
all stories discussing them. In additional to the TDT2 tasks of segmentation, detection
and tracking, TDT3 adds the tasks of first story detection and story-link detection.
The goal of the latter is to detect links between stories that discuss the same topic
even though the topic has not been defined in advance. There are two types of files
in this publication: asr_sgm -- text data output from automatic speech recognition
(ASR) systems in English and Mandarin, formatted in "TIPSTER- style" SGML, derived
from the audio recordings of radio and TV broadcasts. tkn_sgm -- reference text data
(newswire, closed captions and manual transcripts), formatted in "TIPSTER-style" SGML
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T58
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631965
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T60
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 273-316-343-136-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Syllable-Final /s/ Lenition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T60
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication represents a study of lenition of syllable-final // in Latin American
Spanish produced by the Linguistic Data Consortium (LDC). The data used in this study
came from three other LDC corpora, the CALLHOME Spanish Speech corpus, the CALLHOME
Spanish Transcripts, and the CALLHOME Spanish Lexicon. It is a well-known fact that
syllable-final /s/ is subject to lenition in many Latin American Spanish dialects.
Lenition of -/s/ is a variable phonological process in which an -/s /may be aspirated
(pronounced [h]) or deleted altogether. Lenition of -/s/ has been widely studied by
sociolinguists, who have identified various linguistic and extralinguistic factors
that favor the process. Since syllable-final /s /is frequent in Spanish, lenition
has a great effect on overall pronunciation. *Data* Please see file.tbl for the directory
structure of this publication, as well as a complete list of files. The primary data
file consists of data stored in the following fields: * Token id * Code * Confidence
level * Speaker id * Header of the line in the transcript * Words from the transcript
* Location of word in the speakers turn * Location of /s/ in the word * Preceding
segment * Following segment * Word stress pattern * Following word stress pattern
* Word start time * Word end time * Length of pause following word * Coder * Speakers
dialect * Speakers sex * Speakers age * Corrected following word * Comment * Morphological
information There are on the order of 3,000 - 4,000 missing occurences of syllable-final
/s/ encodings. These omissions occur for two main reasons: changes in the transcriptions
after the list of all of the syllable-final /s/ were generated, and the failure of
some transcript lines to be automatically aligned. For a more detailed description
of this publication see the researchers description in HTML or Microsoft Word format.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fox, Michelle A.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T60
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631973
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T61
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 431-710-911-315-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Spanish Dialogue Act Annotation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T61
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME Spanish Dialogue Act Annotation Corpus, Linguistic Data Consortium (LDC)
catalog number LDC2001T61 and ISBN 1-58563-197-3, was developed under Project CLARITY.
The goal of CLARITY was to glean discourse information from unrestricted conversational
speech using shallow, corpus-based analysis. The annotation was carried out at Interactive
Systems Labs at Carnegie Mellon University. *Data* This publication used a three-level
coding scheme to manually tag the LDC CALLHOME Spanish Transcripts. The three levels
of the coding scheme are: * a dialogue act level consisting of a tag set extended
from DAMSL and Switchboard; * a dialogue game level featuring short sequences of dialogue
acts * a genre level similiar to topical segments. All available (120) dialogues have
been annotated. Dialogue games are short sequences of dialogue acts such as question/answer
pairs. Genres can be storytelling, discussion, planning, etc. Segmentation takes topics
into account as well. Genres, games, and dialogue acts are annotated by type. Genres
are additionally annotated for activities and topics (on a 0-5 scale), for the central
object or person being discussed (who or what category), and contain a short synopsis
of the segment. All available 120 CALLHOME Spanish dialogues have been annotated.
The dialogue act annotation scheme is a further development of the SwitchBoard DAMSL
tagset. Dialogue games are short sequences of dialogue acts such as question/answer
pairs. Genres can be storytelling, discussion, planning etc. and the segmentation
takes topic into account as well. Genres, games and dialogue acts are annotated for
their type. Genres are additionally annotated for activities and topics (on a 0-5
scale), for the central object or person being discussed (who or what category) and
contain a short gist of the segment. An example of the tagging from one conversation
is presented below. <?xml version="1.0" encoding="iso-8859-1"?> Sm, eso es para eso,
de seguro. No importa. No importa. Bueno aqum, la Zaida esta estudiando tambiin en
la universidad con la Liana. Y qui estudia, mama, qui estan estudiando. [background
speech] Estan estudiando Sociales. Ciencias Sociales. Ah, para maes- para maestra
de Sociales. Sm
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Waibel, Alex
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lavie, Alon
ADDED ENTRY--PERSONAL NAME
- Personal name:
Levin, Lori
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ries, Klaus
ADDED ENTRY--PERSONAL NAME
- Personal name:
Valle-Argueta, Liza
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T61
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2001 pau u por d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632015
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2001T62
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 544-982-311-455-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
por
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
por
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2001]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2001T62
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CETEMPublico Version 1.7 (Corpus de Extractos de Textos Electronicos MCT/Publico),
produced by the Linguistic Data Consortium (LDC) as catalog number LDC2001S04 with
ISBN 1-58563-206-6, is a corpus of newspaper texts from the Portuguese daily newspaper
Publico, compiled for purposes of research and development in natural language processing
(NLP) by the Computational Processing of Portuguese Project, under an agreement between
Publico and the Portuguese Ministry of Science and Technology (MCT). *Data* The corpus
includes the text of approximately 2,600 editions of Publico, produced between 1991
and 1998, and amounting to approximately 180 million words. CETEMPublico Version 1.7
contains 1,504,258 extracts (CETEMPublico Version 1.0 had 1,567,625). Version 1.7
was created in Oslo on August 6, 2001 and uses SGML tagging. The corpus is in 196
compressed text files, with names in the form cetemXXX.gz, from cetem001.gz to cetem196.gz.
This corpus was designed to assist researchers who develop computer programs processing
the Portuguese language and who would need raw material for their work. In addition,
the authors wished for the corpus to be useful to everyone who studies the Portuguese
language and wishes to verify their hypotheses in previously organized text material.
The online and the CQP versions are meant for such users, who are, in any case, also
welcome to get it on CD in order to process the corpus locally, possibly by means
of the corpus processing system of their choice. More detailed information is available
at http://www.linguateca.pt/cetempublico.
LANGUAGE NOTE
- Language note:
Content in Portuguese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Santos, Diana
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rocha, Paulo
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2001T62
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632384
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002L27
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 108-782-445-016-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese-English Translation Lexicon Version 3.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002L27
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2002 Chinese-English Translation Lexicon Version 3.0 was developed by the Linguistic
Data Consortium (LDC). In 1999, responding to urgent demand for a Chinese-English
bilingual wordlist to support various projects, LDC quickly solicited entries from
both in-house and Internet resources and compiled two versions of Chinese-English
wordlists, "ldc_ce_dict.1.0.gb" (henceforth Version 1) and "ldc_ce_dict.2.0.txt" (henceforth
Version 2). Version 1, with 24,298 entries, was relatively small and its coverage
was unbalanced. Version 2, created as an experiment, is impractical for translingual
information processing. Many of its entries were created by reversing source and target
language fields in various English-to-Chinese wordlists; as a result many entries
are not really words. The increasing demand for richer lexical resources led to the
present release, "ldc_cedict.gb.Version 3" (henceforth Version 3). *Data* What's New
in Version 3 The total number of Chinese headwords in this release is 54,170. In terms
of coverage, Version 3 is a superset of Version 1 and the LDC's Mandarin pronunciation
lexicon (Version 3/Version 4). The pronunciation lexicon has a total of 44,404 entries,
or 43,968 unique Chinese character strings (i.e. with pronunciation removed). There
are still 553 entries from the pronunciation lexicon not found in Version 3. We were
unable to provide accurate translations for these head words for various reasons:
they may be very technical; they don't make sense unless their source is re-examined;
they may have segmentation errors; or they may be rare words for which appropriate
translations could not be found due to limited time and resources. Version 3 also
left out less than 40 entries from Version 1. Most of these are rare single-character
words whose translations cannot be verified for accuracy. *Format* There is one data
file, the lexicon itself. Within the lexicon, each entry is in this format: head_word_in_Chinese_characters
/gloss 1/gloss 2/.../gloss n/ For example: ººÓï /Chinese language/Chinese/ Ó¢ÎÄ /English
language/English/ (A Chinese-capable browser is needed to see this properly. You may
need to change your browser's character set to see Simplified Chinese characters.)
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002L27
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632570
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002L49
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 435-186-167-011-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Buckwalter Arabic Morphological Analyzer Version 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002L49
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Buckwalter Arabic Morphological Analyzer Version 1.0 was produced by Linguistic Data
Consortium (LDC), catalog number LDC2002L49 and ISBN 1-58563-257-0. The Buckwalter
Arabic Morphological Analyzer is used for POS-tagging Arabic text. *Data* The data
consists primarily of three Arabic-English lexicon files: prefixes (299 entries),
suffixes (618 entries), and stems (82,158 entries representing 38,600 lemmas). The
lexicons are supplemented by three morphological compatibility tables used for controlling
prefix-stem combinations (1,648 entries), stem-suffix combinations (1,285 entries),
and prefix-suffix combinations (598 entries). The actual code for morphology analysis
and POS tagging is contained in a Perl script. The documentation consists of a readme
file with a description of the lexicon files, the morphological compatibility tables,
the morphology analysis algorithm, a summary of stem morphological categories, and
a table with the author's Arabic transliteration system.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and English. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002L49
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u ara d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 223-969-897-944-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
West Point Arabic Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
West Point Arabic Speech was produced by the Linguistic Data Consortium (LDC), catalog
number LDC2002S02 and ISBN 1-58563-199-x. West Point Arabic Speech contains speech
data that was collected and processed by members of the Department of Foreign languages
at the United States Military Academy at West Point and the Center For Technology
Enhanced Language Learning (CTELL) as part of an effort called "Project Santiago."
The original purpose of this corpus was to train acoustic models for automatic speech
recognition that could be used as an aid in teaching Arabic to West Point cadets.
*Data* The corpus consists of 8,516 speech files, totaling 1.7 gigabytes or 11.42
hours of speech data. Each speech file represents one person reciting one prompt from
one of four prompt scripts. The utterances were recorded using a Shure SM10A microphone
and a RANE Model MS1 pre-amplifier. The files were recorded as 16-bit PCM low-byte-first
("little-endian") raw audio files, with a sampling rate of 22.05 KHz. They were then
converted to NIST sphere format. Approximately 7,200 of the recordings are from native
informants and 1200 files are from non-native informants. The following tables show
the breakdown of corpus content in terms of male, female, native and non-native speakers.
number of speakers male female total native: 41 34 75 non-native: 25 10 35 totals:
66 44 110 hours of data male female total native: 6.0 4.4 10.4 non-native: 0.74 0.28
1.02 totals: 6.74 4.68 11.42 megabytes of data male female total native: 918 667 1585
non-native: 111.9 42.8 154.7 totals: 1029.9 709.8 1739.7 number of speech files male
female total native: 4107 3163 7270 non-native: 883 363 1246 totals: 4990 3526 8516
Some of the recording sessions include a handful of utterances that were cut short
due to pronunciation mistakes or unexpected interruptions (e.g. phones ringing, doors
slamming, etc). These partial utterances have been retained in the waveform directories
and are distinguished from the full-sentence recordings by having a trailing "-u"
in the filename, before the extension (e.g. "s1_080-u.sph" instead of "s1_080.sph").
The above tables describe all data; both the complete and partial utterances are accounted
for. 168 of the 8,516 speech files are partial utterances, and the remaining 8,348
are complete.
LANGUAGE NOTE
- Language note:
Content in Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
LaRocca, Stephen A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chouairi, Rajaa
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 845-828-590-368-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Translanguage English Database (TED) Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Translanguage English Database (TED) Speech consists of recordings of presentations
made by native English and non-native English speakers at the Third European Conference
on Speech Communication and Technology, EUROSPEECH 1993 in Berlin, Germany. This is
a joint publication with the European Language Resources Association (ELRA) sponsored
in part by National Science Foundation Grant No. IIS-9982201. The data set is released
by ELRA as TED Translanguage English Database (ELRA-S0031). *Data* The audio recordings
contain 188 speakers presenting academic papers for approximately 15 minutes each.
Transcripts for 39 of the recordings are available in Translanguage English Database
(TED) Transcripts LDC2002T03 and in Translanguage English Database (TED) Transcripts
database ELRA-S0120.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mariani, J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schiel, F.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632228
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 603-855-311-336-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Switchboard-2 Phase III Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Switchboard-2 Phase III Audio corpus was produced by the Linguistic Data Consortium;
catalog number LDC2002S06 and ISBN number 1-58563-222-8. This release contains speech
data files ONLY, along with documentation describing speaker information (sex, age,
education, city and state where raised), call information (date, time, call duration,
Personal Identification Numbers, topic), and audit information (channel quality, background
noise). The data files are not compressed. The Switchboard-2 Phase III collection
was focused primarily in the American South. The collection commenced on October 21,
1997 and was completed on January 1, 1998. The project's goal was to target native
speakers of English in the American South, balanced by gender, to participate in (10+)
five to six minute conversations on a variety of telephone (land line) handsets. *Data*
The speech data was collected for research, development, and evaluation of automatic
systems for speech-to-text conversion, talker identification, language identification
and speech signal detection purposes. During the collection period, the LDC collected
a total of 2,728 calls, or 5,456 sides, from 640 participants (292 Male, 348 Female),
under varied environmental conditions. Each speech file consists of a 1,024-byte ASCII-formatted
Sphere header, followed by two-channel interleaved mu-law sample data. The mu-law
samples represent the actual digital data transmission from the telephone service
provider (MCI), as captured separately for each side of the telephone conversation
by the LDC's telephone collection platform. The header also indicates the caller_pin,
callee_pin, topic_id. The speech files are named according to the following pattern:
sw_NNNNN.sph where the five-digit string "NNNNN" represents the conversation-id; this
string is used to identify all speech files and to identify the calls in the associated
data base tables that provide information about the calls and participants (i.e. callstat.tbl,
master.tbl). Other documentation files available on the publication are: 0readme.1st
Field information for all database tables swb_callaudit.tbl Audit results for each
channel swb_callaudit.txt Document describing audit table swb_callstats.tbl Information
about recorded calls swb_callstats.txt Document describing callstats table swb_callsubjects.tbl
Demographic information swb_callsubjects.txt Document describing callsubjects table
topics.txt List of proposed call topics There are a total of 2,657 data files (=~
222 hours of audio)
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632252
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 650-378-591-541-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2000 HUB5 English Evaluation Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2000 HUB5 English Evaluation was developed by the Linguistic Data Consortium (LDC)
and consists of English conversational telephone speech used in the 2000 HUB5 evaluation
sponsored by NIST (National Institute of Standards and Technology). The Hub5 evaluation
series focused on conversational speech over the telephone with the particular task
of transcribing conversational speech into text. Its goals were to explore promising
new areas in the recognition of conversational speech, to develop advanced technology
incorporating those ideas and to measure the performance of new technology. Further
information about the evaluation can be found on the NIST HUB5 website. *Data* The
source data consists of conversational telephone speech collected by LDC: (1) 20 unreleased
telephone conversations from the Swtichboard studies in which recruited speakers were
connected through a robot operator to carry on casual conversations about a daily
topic announced by the robot operator at the start of the call; and (2) 20 telephone
conversations from CALLHOME American English Speech which consists of unscripted telephone
conversations between native English speakers. The audio files are two channel interleaved
mulaw in sphere format. The sphere headers have been modified from the original evaluation
data by the addition of sample checksums to the CALLHOME data files. A documentation
table contains information on the speech segments. Corresponding transcripts are available
in 2000 HUB5 English Evaluation Transcripts (LDC2003T43).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632260
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 320-759-350-893-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1998 HUB5 English Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
1998 HUB5 English Evaluation was developed by the Linguistic Data Consortium (LDC)
and consists of English conversational telephone speech used in the 1998 HUB5 evaluation
sponsored by NIST (National Institute of Standards and Technology). The Hub5 evaluation
series focused on conversational speech over the telephone with the particular task
of transcribing conversational speech into text. Its goals were to explore promising
new areas in the recognition of conversational speech, to develop advanced technology
incorporating those ideas and to measure the performance of new technology. Further
information about the evaluation can be found on the NIST HUB5 website and in The
1998 HUB-5E Evaluation Plan for Recognition of Conversational Speech over the Telephone
in English, included in this release. *Data* The source data consists of conversational
telephone speech collected by LDC: (1) 20 telephone conversations from Swtichboard-2
Phase 1 (LDC98S75) in which recruited speakers were connected through a robot operator
to carry on casual conversations about a daily topic announced by the robot operator
at the start of the call; and (2) 20 telephone conversations from CALLHOME American
English Speech which consists of unscripted telephone conversations between native
English speakers. The audio files are two channel interleaved mulaw in sphere format.
The sphere headers have been modified from the original evaluation data by the addition
of sample checksums to the CALLHOME data files. Corresponding transcripts are available
in 1998 HUB5 English Transcripts (LDC2003T02).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632279
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 484-643-355-341-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 HUB4 English Evaluation Speech and Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1997 HUB4 English Evaluation, Linguistic Data Consortium (LDC) catalog number
LDC2002S11 and ISBN 1-58563-227-9, is part of an ongoing series of periodic evaluations
conducted by NIST. These evaluations provide an important contribution to the direction
of research efforts and the calibration of technical capabilities. They are intended
to be of interest to all researchers working on the general problem of conversational
speech recognition. To this end, the evaluation was designed to be simple, to focus
on core speech technology issues, to be fully supported, and to be accessible. The
purpose of this evaluation is to foster research on the problem of accurately transcribing
broadcast news speech and to measure objectively the state of the art. Additional
documentation is available at the 1997 NIST Evaluation Plan for Broadcast News webpage.
*Data* This year the entire test set is contained in a single waveform file. The SPHERE-formatted
waveforem file h4e_97.sph is located in the h4e_evl directory of this CD-ROM. The
waveform file contains 334 Mbytes of sphere data, which represents approximately three
hours of concatenated radio and television broadcast news stories. The transcript
file contains a rough number of 30,600 total tokens and 4,800 unique tokens.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632287
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 002-640-961-995-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2001 HUB5 Mandarin Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2001 HUB5 Mandarin Evaluation is part of an ongoing series of periodic evaluations
conducted by NIST. These evaluations provide an important contribution to the direction
of research efforts and the calibration of technical capabilities. They are intended
to be of interest to all researchers working on the general problem of conversational
speech recognition. To this end the evaluation was designed to be simple, to focus
on core speech technology issues, to be fully supported, and to be accessible. The
evaluation was held from February 21 - March 12, 2001. The systems were to produce
character-level transcripts and character-level confidence scores for the complete
set of evaluation test material. Additional information about the evaluation is available
from this NIST website. *Data* The test data comes from unexposed Mandarin CALLHOME
Conversations, stored in sphere format. There are 20 sphere files encoded in two-channel
interleaved mulaw for a total of 441,990,656 bytes (421 Mbytes) or eight hours of
sphere data. These conversations were transcribed and time-marked by speaker turn,
by the LDC. An included documentation table contains information on the speech segments
to be processed as follows: ...
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632295
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 158-193-057-195-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2001 HUB5 English Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2001 HUB5 English Evaluation was developed by the Linguistic Data Consortium and consists
of approximately 5 hours of English conversational telephone speech and associated
transcripts used in the 2001 HUB5 evaluation sponsored by NIST (National Institute
of Standards and Technology). The HUB5 evaluation series focused on conversational
speech recognition over the the telephone with the particular task of transcribing
conversational speech into text. Its goals were to explore promising new areas in
the recognition of conversational speech, to develop advanced technology incorporating
those ideas and to measure the performance of the new technology. Further information
about the evaluation is contained in The 2001 NIST Evaluation Plan for Recognition
of Conversational Speech over the Telephone, included in this release. *Data* The
source data consists of conversational telephone speech collected between 1990-2000
under the Switchboard protocol, specifically, 20 conversations from each of Switchboard-1,
Release 2 (LDC97S62), Switchboard-2 Phase III Audio (LDC2002S06) and from the Switchboard
cellular phone collection, Switchboard Cellular Part 1 Audio (LDC2001S13) and Switchboard
Cellular Part 2 Audio (LDC2004S07). In the Switchboard study, recruited speakers were
connected through a robot operator to carry on casual conversations about a daily
topic announced by the robot operator at the start of the call. The audio files are
two-channel μlaw recordings in sphere format. The corresponding transcripts are presented
in stm format.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632325
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 359-804-609-199-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 HUB5 Arabic Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1997 HUB5 Arabic Evaluation was produced by the Linguistic Data Consortium (LDC);
catalog number LDC2002S22 and ISBN 1-58563-232-5. The 1997 HUB5 Non-English evaluation
is part of an ongoing series of periodic evaluations conducted by NIST. These evaluations
provide an important contribution to the direction of research efforts and the calibration
of technical capabilities. They are intended to be of interest to all researchers
working on the general problem of conversational speech recognition. To this end the
evaluation was designed to be simple, to focus on core speech technology issues, to
be fully supported, and to be accessible. The HUB5 Non-English evaluation, conducted
in the fall of 1997, complemented another related evaluation which was conducted in
the spring. The spring evaluation focuses on the recognition of conversational speech
in English. This evaluation is dedicated to the advancement of speech recognition
technology for languages other than English; specifically for Arabic, German, Mandarin,
and Spanish. It focuses also on issues related to porting recognition technology to
new languages, to system generality, and to language commonalties and universals.
The HUB5 Non-English evaluation focuses on the task of transcribing conversational
speech into text. This task is posed in the context of conversational telephone speech
in Arabic, German, Mandarin, and Spanish. The evaluation is designed to foster research
progress, with the goals of: * exploring promising new ideas in the recognition of
conversational speech * developing advanced technology incorporating these ideas,
and * measuring the performance of this technology The task is to transcribe conversational
speech. The speech to be transcribed is presented as a set of conversations collected
over the telephone. Each conversation is represented as a "4-wire" recording, that
is with two distinct sides, one from each end of the telephone circuit. Each side
is recorded and stored as a standard telephone codec signal (8 kHz sampling, 8-bit
mu-law encoding). Each conversation is represented as a sequence of "turns," where
each turn is the period of time when one speaker is speaking. Each successive turn
results from a reversal of speaking and listening roles for the conversation participants.
The transcription task is to produce the correct transcription for each of the specified
turns. The beginning and ending times of each of these turns will be supplied as side
information to the system under test. This turn information will be supplied in NDX
format, with one NDX file for all conversations to be transcribed. (Note that the
turns are not necessarily a simple sequence of non-overlapping time intervals. They
may be overlapping or non-alternating from time to time, because there is no sequencing
constraint on conversational interaction.) Additional documentation is available at
the 1997 NIST Evaluation Plan for Recognition of Conversational Speech Over the Telephone
website. *Data* This publication contains 20 sphere files encoded in two channel interleaved
mulaw with a sampling rate of 8 KHz, for a total of 424,160,000 bytes (405 Mbytes)
of sphere data. The sphere headers have been modified from the original Evaluation
data by the addition of sample checksums to the CALLHOME data files. An included documentation
table contains information on the speech segments to be processed as follows: ...
LANGUAGE NOTE
- Language note:
Content in Egyptian Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632333
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 972-929-534-164-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 HUB5 English Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
1997 HUB5 English Evaluation was developed by the Lingustic Data Consortium (LDC)
and consists of English conversational telephone speech and associated transcripts
used in the 1997 HUB5 evaluation sponsored by NIST (National Institute of Standards
and Technology). The Hub5 evaluation series focused on conversational speech over
the telephone with the particular task of transcribing conversational speech into
text. Its goals were to explore promising new areas in the recognition of conversational
speech, to develop advanced technology incorporating those ideas and to measure the
performance of new technology. Further information about the evaluation can be found
on the NIST HUB5 website and in The 1997 HUB-5E Evaluation Plan for Recognition of
Conversational Speech over the Telephone in English, included in this release. *Data*
The source data consists of conversational telephone speech collected by LDC: (1)
20 telephone conversations from the Swtichboard-2 studies (LDC98S75, LDC98S79) in
which recruited speakers were connected through a robot operator to carry on casual
conversations about a daily topic announced by the robot operator at the start of
the call; and (2) 20 telephone conversations from CALLHOME American English Speech
which consists of unscripted telephone conversations between native English speakers.
The audio files are in sphere format. The sphere headers have been modified from the
original evaluation data by the addition of sample checksums to the CALLHOME data
files. The corresponding transcripts are presented in text format.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u ger d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632341
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 277-654-929-622-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 HUB5 German Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1997 HUB5 German Evaluation was produced by Linguistic Data Consortium (LDC) catalog
number LDC2002S24 and ISBN 1-58563-234-1. The 1997 HUB5 Non-English Evaluation is
part of an ongoing series of periodic evaluations conducted by NIST. These evaluations
provide an important contribution to the direction of research efforts and the calibration
of technical capabilities. They are intended to be of interest to all researchers
working on the general problem of conversational speech recognition. To this end the
evaluation was designed to be simple, to focus on core speech technology issues, to
be fully supported, and to be accessible. The HUB5 Non-English Evaluation, conducted
in the fall of 1997, complemented another related evaluation which was conducted in
the spring of that year. The spring evaluation focuses on the recognition of conversational
speech in English. This evaluation is dedicated to the advancement of speech recognition
technology for languages other than English, and specifically this year for Arabic,
German, Mandarin, and Spanish. It focuses also on issues related to porting recognition
technology to new languages, to system generality, and to language commonalties and
universals. The HUB5 Non-English Evaluation focuses on the task of transcribing conversational
speech into text. This task is posed in the context of conversational telephone speech
in Arabic, German, Mandarin, and Spanish. The evaluation is designed to foster research
progress, with the goals of: * exploring promising new ideas in the recognition of
conversational speech * developing advanced technology incorporating these ideas *
measuring the performance of this technology The task is to transcribe conversational
speech. The speech to be transcribed is presented as a set of conversations collected
over the telephone. Each conversation is represented as a "4-wire" recording, that
is with two distinct sides, one from each end of the telephone circuit. Each side
is recorded and stored as a standard telephone codec signal (8 kHz sampling, 8-bit
u-law encoding). Additional documentation is available at the NIST website. *Data*
This publication contains 20 sphere files encoded in two channel interleaved mulaw
with a sampling rate of 8 KHz, for a total of 561,150,160 bytes (535 Mbytes) or nine
hours of sphere data. An included documentation table contains information on the
speech segments to be processed as follows: ...
LANGUAGE NOTE
- Language note:
Content in German. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u spa d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 676-014-440-538-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 HUB5 Spanish Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1997 HUB5 Spanish Evaluation was produced by Linguistic Data Consortium (LDC)
catalog number LDC2002S25 and ISBN 1-58563-235-x. The 1997 HUB5 Non-English Evaluation
is part of an ongoing series of periodic evaluations conducted by NIST. These evaluations
provide an important contribution to the direction of research efforts and the calibration
of technical capabilities. They are intended to be of interest to all researchers
working on the general problem of conversational speech recognition. To this end the
evaluation was designed to be simple, to focus on core speech technology issues, to
be fully supported, and to be accessible. The HUB5 Non-English Evaluation focuses
on the task of transcribing conversational speech into text. This task is posed in
the context of conversational telephone speech. The evaluation is designed to foster
research progress, with the goals of: * exploring promising new ideas in the recognition
of conversational speech * developing advanced technology incorporating these ideas
* measuring the performance of this technology The task is to transcribe conversational
speech. The speech to be transcribed is presented as a set of conversations collected
over the telephone. Each conversation is represented as a "4-wire" recording, that
is with two distinct sides, one from each end of the telephone circuit. Each side
is recorded and stored as a standard telephone codec signal (8 kHz sampling, 8-bit
mu-law encoding). Additional documentation is available on the NIST website. *Data*
This publication contains 20 sphere files encoded in two channel interleaved mulaw
with a sampling rate of 8 KHz, for a total of 447,201,280 bytes (426 Mbytes) or seven
hours of sphere data. An included documentation table contains information on the
speech segments to be processed as follows: ...
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632376
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S28
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 191-383-337-125-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Emotional Prosody Speech and Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S28
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Emotional Prosody Speech and Transcripts was developed by the Linguistic Data Consortium
and contains audio recordings and corresponding transcripts, collected over an eight
month period in 2000-2001 and designed to support research in emotional prosody. The
recordings consist of professional actors reading a series of semantically neutral
utterances (dates and numbers) spanning fourteen distinct emotional categories, selected
after Banse & Scherers study of vocal emotional expression in German. (Banse, R. &
Scherer, K. R. 1996. Acoustic profiles in vocal emotion expression. Journal of Personality
and Social Psychology, 70, 614-636.) Actor participants were provided with descriptions
of each emotional context, including situational examples adapted from those used
in the original German study. Flashcards were used to display series of four-syllable
dates and numbers to be uttered in the approriate emotional category. The Prosody
Recordings Project was interested in capturing the aspects of speech (emotion, intonation)
that are left out of the written form of a message. In these experiments, simple phrases
are expressed in ways that reflect varied contexts. The same phrase might be used
to answer different questions, address listeners at different distances from the speaker,
or express different emotional states. Actors were used because they are experts at
producing this kind of contextual variation in a natural and convincing way. *Data*
There are 30 data files: 15 recordings in sphere format and their transcripts. The
sphere files are encoded in two-channel interleaved 16-bit PCM, high-byte-first (big-endian)
format, for a total of 2,912,067,980 bytes (2777 Mbytes) or nine hours of sphere data.
The utterences were recorded directly into WAVES+ datafiles, on two channels with
a sampling rate of 22.05K. The two microphones used were a stand-mounted boom Shure
SN94 and a headset Seinnheiser HMD 410. The original session recordings are provided
in their entirety, including informal chit-chat and discussion between each emotion
category elicitation task. Time alignment is limited to utterances within the formal
elicitation tasks and miscellanous regions have been marked as such.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liberman, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Davis, Kelly
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grossman, Murray
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bell, John
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S28
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632414
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S34
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 176-469-534-264-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2001 NIST Speaker Recognition Evaluation Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S34
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 2001 NIST Speaker Recognition Evaluation Corpus was produced by the Linguistic
Data Consortium (LDC) catalog number LDC2002S34 and ISBN 1-58563-241-4. The 2001 NIST
Speaker Recognition Evaluation is part of an ongoing series of yearly evaluations
conducted by NIST. These evaluations provide an important contribution to the direction
of research efforts and the calibration of technical capabilities. They are intended
to be of interest to all researchers working on the general problem of text independent
speaker recognition. To this end the evaluation was designed to be simple, to focus
on core technology issues, to be fully supported, and to be accessible. The corpus
is based entirely on conversational cellular telephone speech collected by the LDC.
Supporting documentation for this evaluation may be found on the 2001 NIST Speaker
Recognition Evaluation website. Consult the NIST evaluation plan for detailed instructions
on using this evaluation material. *Data* The files are divided into evaluation and
development data. There are a total of 2,350 compressed speech files, all of which
are in sphere format. The sphere files are compressed and encoded in one channel 8-bit
mulaw, for a total of 575,337,198 bytes (548.7 Mbytes), or 26 hours of sphere data.
The evaluation data is divided into evaluation training data and evaluation test data.
The training data consists of 174 speech files that are two minutes long. The test
data comprises 2,038 speech files of varying lengths not exceeding sixty seconds.
The development data is similarly divided into development training data and development
test data. The training data comprises 60 speech files with durations of two minutes
per target speaker. The 78 development test data files contain segments of varying
length not exceeding 60 seconds.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S34
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632422
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S35
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 550-933-474-715-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Voicemail Corpus Part II
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S35
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Voicemail Corpus Part II was produced by Linguistic Data Consortium (LDC) catalog
number LDC2002S35 and ISBN 1-58563-242-2. Voicemail Corpus Part II is a continuation
of Voicemail Corpus Part I, LDC98S77. *Data* This publication is comprised of speech
and script files, and is structured in training and evaluation data. The training
data consists of 2,048 voicemail messages and the corresponding script files. The
speech and script files are organized in 41 directories, each of which contains up
to 50 messages. The evaluation data consists of 50 voicemail messages and 50 scripts.
The speech data is provided in sphere format it is sampled at 8 KHz, and recorded
in 8-bit ulaw, totalling approximately 14 hours (406 MB) for training and 23 minutes
(11 MB) for evaluation. In addition to the individual script files, there are three
files which represent a concatenation of the individual scripts: train_scripts.all
and eval_scripts .all represent a concatenation of the training and evaluation script
files, one file per line, each line beginning with the fileID. eval_scripts_filtered.all
is a filtered version of the file eval_scripts.all, after eliminating the tagged elements
() and the proper nouns marker.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Padmanabhan, Mukund
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kingsbury, Brian
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ramabhadran, Bhuvana
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Jing
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chen, Stanley
ADDED ENTRY--PERSONAL NAME
- Personal name:
Saon, George
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mangu, Lidia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S35
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632430
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S37
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 710-821-044-437-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Egyptian Arabic Speech Supplement
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S37
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME Egyptian Arabic Speech Supplement was produced by Linguistic Data Consortium
(LDC), catalog number LDC2002S37 and ISBN 1-58563-243-0. This publication contains
20 CALLHOME Egyptian Arabic telephone conversations. The corresponding transcripts
are published as CALLHOME Egyptian Arabic Transcripts Supplement, LDC catalog number
LDC2002T38. These conversations had originally been held in reserve for future NIST
HUB5 Non-English evaluations, but are being "re-tasked" to provide additional material
for general use. *Data* There are 20 data files in sphere format. The files are 8
KHz shorten-compressed two-channel mulaw. 12 of the files were recorded from domestic
phone calls (both parties living in the continental U.S.), while the other eight are
overseas calls (a participant in the U.S. called a friend or relative in Egypt or
some other overseas country). There is a total of 273,681,144 bytes (261 Mbytes) or
eight hours of audio data.
LANGUAGE NOTE
- Language note:
Content in Egyptian Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S37
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632589
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002S56
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 679-178-608-649-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2000 Communicator Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002S56
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2000 Communicator Evaluation was produced by Linguistic Data Consortium (LDC) catalog
number LDC2002S56 and ISBN 1-58563-258-9. The original goals of the Communicator program
were to support the creation of speech-enabled interfaces that scale gracefully across
modalities, from speech-only to interfaces that include graphics, maps, pointing and
gesture. The original vision of the Communicator systems included the ability of a
user, during one 10-minute session, to plan a three-leg trip, with the three flights/legs
on three different days, with rental car and hotel in each of the two "away" cities,
plus dictating/sending a voice-mail message. The actual research that led to the data
collections in 2000 and 2001 explored ways to construct better spoken-dialogue sys
tems, with which users interact via speech-alone to perform relatively complex tasks
such as travel planning. During 2000 and 2001 two large data sets were collected,
in which users used the Communicator systems built by the research groups to do travel
planning. The researchers improved their systems intensively during the ten months
between the two data collections. This distribution consists of all the data from
the 2000 collection. All the Communicator implementations used a common software architecture,
called Galaxy-II, which was designed by a research team at MIT and adapted for Communicator
in collaboration with a team at MITRE. The architecture supported detailed logging
of the interaction between users a nd the systems. *Data* Nine sites participated
in this project: ATT, BBN, Carnegie Mellon University, IBM, MIT, MITRE, NIST, SRI
and University of Colorado at Boulder. In 2000, each user called the nine different
automated travel-planning systems to make simulated flight reservations. The order
in which the users encountered the systems was counterbalanced, for statistical analysis
purposes. All aspects of the reservations were simulated in 2000. Each user was to
make nine calls. The first seven calls had an assigned hypothetical travel task, which
the user got via th e web. The last two calls asked the user to make simulated travel
reservations for a trip that they might wish to take: they were asked to make travel
plans for a vacation or pleasure trip on the eighth call and a business trip paid
for by an employer on the ninth call. All audio files are in SPHERE format, recorded
in 8-bit u-law and pcm, at 8 KHZ. The files consist of the sites' recordings and the
NIST recordings. The sites' recordings are utterance level (one channel) while the
NIST recordings are a continuous recording of the whole call (both channels: user
and system). The two-channel sphere files total ~62 hours of audio (3415 MB), representing
~317K words in transcription. The caller side of the calls have had sample_checksums
added to the files headers submitted by the sites.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Marilyn
ADDED ENTRY--PERSONAL NAME
- Personal name:
Aberdeen, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sanders, Gregory
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002S56
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632171
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 642-346-533-147-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multiple-Translation Chinese Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multiple-Translation Chinese Corpus was developed by the Linguistic Data Consortium
(LDC) and contains 17 human and machine tranlsations of Chinese newswire and broeadcast
data. To support the development of automatic means for evaluating translation quality,
LDC was sponsored to solicit 11 sets of human translations for a single set of Mandarin
Chinese source materials. LDC was also asked to produce translations from various
commercial machine translation (MT) systems as well as from MT systems available on
the Internet. * Data* Three sources of journalistic Mandarin Chinese text were selected
from existing LDC corpora: * Xinhua News Service: 52 news stories * Zaobao News Service:
27 news stories * Voice of America Mandarin broadcast transcripts: 26 news stories
for a total of 105 stories. For an example of Mandarin Chinese Text please click on
this example. For an example of English translation please click on this example.
The Xinhua data were drawn from Chinese Treebank 2.0 (LDC2001T11); the file names
and "doc_id" attributes assigned to these stories match the file names used in the
Chinese Treebank release. The Zaobao and VOA data were both drawn from TDT3 Multilanguage
Text Version 2.0 (LDC2001T58); their file names and "doc_id" attributes match the
"DOCNO" tags assigned to these stories in the TDT3 release. Selection of stories from
the two newswire collections was controlled by story length: all selected stories
contain between about 340 and 400 Chinese characters. The selection from VOA broadcasts
varied more widely, between 100 and 1,000 characters per story. The VOA Mandarin transcripts
in TDT3 were created manually by a professional transcription service, but with limited
editorial quality control -- while generally quite complete, these transcripts were
not expected to exceed the quality or accuracy of closed-caption text in television
broadcasts. Zaobao is a news portal from Singapore and many of its news stories are
translations from other news agencies' releases. There are a total of 1,890 data files.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632023
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 502-719-830-448-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Translanguage English Database (TED) Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Translanguage English Database (TED) Transcripts consists of transcripts of presentations
by 39 native English and non-native English speakers at the Third European Conference
on Speech Communication and Technology, EUROSPEECH 1993 in Berlin, Germany. This is
a joint publication with the European Language Resources Association (ELRA) sponsored
in part by National Science Foundation Grant No. IIS-9982201. The data set is released
by ELRA as Translanguage English Database (TED) Transcripts database (ELRA-S0120).
*Data* The transcripts in this release were developed by the Linguistic Data Consortium
and are a subset of the speech recordings in Translanguage English Database (TED)
Speech LDC2002S04 and ELRA publication ELRA-S0031. The transcripts are in Universal
Transcription Format (UTF). All UTF files were validated against a utf.dtd. Tables
containing speaker demographic information and cross-references of file names from
the TED audio corpus are included this release. A transcript sample is available here.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mariani, J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schiel, F.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, N.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, D.A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jones, K.T.
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Markoff, R.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632236
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 299-735-991-930-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
RST Discourse Treebank
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Rhetorical Structure Theory (RST) Discourse Treebank was developed by researchers
at the Information Sciences Institute (University of Southern California), the US
Department of Defense and the Linguistic Data Consortium (LDC). It consists of 385
Wall Street Journal articles from the Penn Treebank annotated with discourse structure
in the RST framework along with human-generated extracts and abstracts associated
with the source documents. In the RST framework (Mann and Thompson, 1988), a text's
discourse structure can be represented as a tree in four aspects: (1) the leaves correspond
to text fragments called elementary discourse units (the mininal discourse units);
(2) the internal nodes of the tree correspond to contiguous text spans; (3) each node
is characterized by its nuclearity, or essential unit of information; and (4) each
node is also characterized by a rhetorical relation between two or more non-overlapping,
adjacent text spans. *Data* The data in this release is divided into a training set
(347 documents) and a test set (38 documents). All annotations were produced using
a discourse annotation tool that can be downloaded from http://www.isi.edu/~marcu/discourse.
Human-generated material in the corpus includes (1) long and short abstracts for 30
documents that were intended to convey the essential information and the main topic
of the article, respectively; and (2) long, short and informative extracts for 180
documents, some of which were created from scratch and some of which were derived
from the humanly-producted abstracts indicated above.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Carlson, Lynn
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcu, Daniel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Okurowski, Mary Ellen
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632368
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002T26
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 977-393-913-599-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Korean English Treebank Annotations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002T26
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the Korean English Treebank Annotations, Linguistic
Data Consortium (LDC) catalog number LDC2002T26 and ISBN 1-58563-236-8. This corpus
consists of 33 texts originally written in Korean and translated into English for
the purpose of language training in a military setting. The conversations are not
authentic dialogues but were constructed for pedagogical purposes. The texts were
made available for linguistic research by the Defense Language Institute (DLI). They
were delivered on paper to the Institute for Research in Cognitive Science (IRCS)
at the University of Pennsylvania, where they were converted to digital form using
the KSC 5601 character set encoding (also known as KS X 1001 Wansung). Both the Korean
and English texts are presented with complete Treebank annotation which was done manually
at IRCS, including syntactic constituent bracketing and part-of-speech (POS) tagging.
Further documentation about the parsing and POS specifications used in these annotations
can be found on the Korean NLP web site. *Data* There are 66 data files: 33 for Korean
and 33 for English. The text files mostly contain sets of question and answer sentences.
A full, unannotated sentence is presented first, on a single line with an initial
semi-colon character ";" -- the first token on such lines (the string preceding the
first space character on the line) is a sentence-identifier tag that matches the English
and Korean versions of the sentence. The parsed/POS-tagged annotation of the sentence
follows on subsequent lines.
LANGUAGE NOTE
- Language note:
Content in Korean and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Han, Chung-Hye
ADDED ENTRY--PERSONAL NAME
- Personal name:
Han, Na-Rae
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ko, Eon-Suk
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yi, Hee-Jong
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Chris
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duda, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002T26
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632406
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002T31
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 153-002-267-999-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
The AQUAINT Corpus of English News Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002T31
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The AQUAINT Corpus, Linguistic Data Consortium (LDC) catalog number LDC2002T31 and
ISBN 1-58563-240-6 consists of newswire text data in English, drawn from three sources:
the Xinhua News Service (People's Republic of China), the New York Times News Service,
and the Associated Press Worldstream News Service. It was prepared by the LDC for
the AQUAINT Project, and will be used in official benchmark evaluations conducted
by National Institute of Standards and Technology (NIST). *Data* The data files contain
roughly 375 million words correlating to about 3GB of data. The text data are separated
into directories by source (apw, nyt, xie); within each source, data files are subdivided
by year, and within each year, there is one file per date of collection. Each file
is named to reflect the source and date, and contains a stream of SGML-tagged text
data presenting the series of news stories reported on the given date as a concatenation
of DOC elements (i.e. blocks of text bounded by and tags). All data files are published
in compressed form, using the GNU "gzip" utility; as such, all files have a ".gz"
extension, and will have null file name extension when uncompressed in the usual way
(i.e. just the base file name, consisting of "YYYYMMDD_SRC"). While all the data files
are covered by a single DTD, it is not the case that they all have a single pattern
of markup. Rather, all files share a core markup structure, with minor variations
in the peripheral regions of each DOC element, and the DTD has been written to accommodate
the variations.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002T31
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632449
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002T38
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 244-440-307-221-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Egyptian Arabic Transcripts Supplement
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002T38
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME Egyptian Arabic Transcripts Supplement corpus was produced by Linguistic
Data Consortium (LDC), catalog number LDC2002T38 and ISBN 1-58563-244-9. This publication
contains transcripts for 20 CALLHOME Egyptian Arabic telephone conversations. These
20 conversations are published as CALLHOME Egyptian Arabic Speech Supplement LDC2002S37.
These conversations had originally been held in reserve for future NIST HUB5 Non-English
evaluations, but are being "re-tasked" to provide additional material for general
use. *Data* There are 40 data files. Each of the 20 calls has transcripts in two formats:
.txt and .scr. The .txt files are transcript files containing the Romanized orthographic
forms that were used in the original transcription process. These forms also serve
as the head-words in the associated Egyptian Colloquial Lexicon LDC99L22. The .scr
files are transcript files rendered in Arabic script orthography, using the ISO 8859-6
character set; these files were derived from the .txt files by replacing each word
token with its Arabic script counterpart (which is also provided in the Egyptian Colloquial
Arabic Lexicon). These files have been formatted to avoid problems of bi-directional
text: line-feed characters are used to separate ASCII content from Arabic script content
in each utterance. Please follow these links for sample transcripts: txt | scr
LANGUAGE NOTE
- Language note:
Content in Egyptian Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002T38
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632457
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002T39
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 781-317-110-120-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 HUB5 Arabic Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002T39
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1997 HUB5 Arabic Transcripts corpus was produced by the Linguistic Data Consortium
(LDC), catalog number LDC2002T39 and ISBN 1-58563-245-7. This publication contains
transcripts for twenty CALLHOME Egyptian Arabic telephone conversations. These 20
conversations were used in NIST's 1997 HUB5 Non-English evaluation, and are published
as 1997 HUB5 Arabic Evaluation LDC2002S22. *Data* There are 40 data files. Each of
the 20 calls has transcripts in two formats: .txt and .scr. The .txt files are transcript
files containing the Romanized orthographic forms that were used in the original transcription
process. These forms also serve as the head-words in the associated Egyptian Colloquial
Arabic Lexicon LDC99L22. The .scr files are transcript files rendered in Arabic script
orthography, using the ISO 8859-6 character set; these files were derived from the
.txt files by replacing each word token with its Arabic script counterpart (which
is also provided in the CALLHOME Arabic Lexicon). These files have been formatted
to avoid problems of bi-directional text: line-feed characters are used to separate
ASCII content from Arabic script content in each utterance. Please follow these links
for sample transcripts: txt | scr
LANGUAGE NOTE
- Language note:
Content in Egyptian Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002T39
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u bai d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632554
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003L01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 880-081-036-797-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
bai
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
fre
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ybb
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
fra
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Grassfields Bantu Fieldwork: Dschang Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003L01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Grassfields Bantu Fieldwork: Dschang Lexicon was produced by Linguistic Data Consortium
(LDC) catalog number LDC2003L01 and ISBN 1-58563-255-4. The data contains a lexicon
of the language Yémba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language
spoken by 300,000+ people in Southwestern Cameroon. *Data* There are two sets of pages
in HTML format: one set is 28 pages, the other is 105. In the first set, one page
lists every word in the lexicon that begins with a given letter (or letter combination).
In the second set, one page lists approximately 25 entries, unless there are fewer
than 25 entries beginning with a given letter. The second set is more convenient for
use on slower machines. Each entry under a headword has links to two or more sound
files. Two speakers were recorded; a given entry generally contains links to utterances
of the headword by both speakers, as well as a laryngograph recording of the headword
by the second speaker. Nouns may also have links to pronunciations of the plurals,
and verbs may have links to pronunciations of the imperative and infinitive forms.
These ancillary forms will in general also have laryngograph recordings. Recorded:
May 1997, Dschang, recording studio of SIL Cameroon, Yaoundé Digitized, Labelled and
Segmented: 1997-1998 Phonetics Laboratory, University of Edinburgh Annotated: 1998-2002
LDC, University of Pennsylvania Sponsorship: SIL Cameroon Economic and Social Research
Council (UK) Grant R000235540 National Science Foundation (US) Grant 9983258 National
Science Foundation (US) TalkBank Project Grant BCS-998009, KDI, SBE Linguistic Data
Consortium
LANGUAGE NOTE
- Language note:
Content in Yemba, English, and French. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Yemba language
- Form subdivision:
Dictionaries
- General subdivision:
English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Yemba language
- Form subdivision:
Dictionaries.
- General subdivision:
Pronunciation
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bird, Steven
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003L01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632651
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003L02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 261-728-030-958-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Korean Telephone Conversations Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003L02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Korean Telephone Conversations Lexicon was produced by Linguistic Data Consortium
(LDC) catalog number LDC2003L02 and ISBN 1-58563-265-1. Korean Telephone Conversations
Lexicon consists of 25,251 words, and contains separate fields with phonological,
morphological, and frequency information for each word. The lexicon covers the tokens
occurring in 100 telephone conversations transcribed and published as Korean Telephone
Conversations Transcripts. The token coverage is 100%. The corresponding speech is
published as Korean Telephone Conversations Speech. *Data* The lexicon contains five
tab-separated information fields: * orthographic form in Hangul (head-word), encoded
in the KSC-5601 (Wansung) system * orthographic form in Yale romanization * pronunciation
* frequency of the word in Korean Telephone Conversations Transcripts * morphological
analysis of the word Please follow this link for a sample page from the lexicon: txt
| gif.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Spoken Korean
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Han, Na-Rae
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kim, Myeonchul
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003L02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632597
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 042-183-636-648-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2001 Communicator Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2001 Communicator Evaluation was produced by Linguistic Data Consortium (LDC) catalog
number LDC2003S01 and ISBN 1-58563-259-7. The original goals of the Communicator program
were to support the creation of speech-enabled interfaces that scale gracefully across
modalities, from speech-only to interfaces that include graphics, maps, pointing and
gesture. The original vision of the Communicator systems included the ability of a
user, during one 10-minute session, to plan a three-leg trip, with the three flights/legs
on three different days, with rental car and hotel in each of the two "away" cities,
plus dictating/sending a voice-mail message. The actual research that led to the data
collections in 2000 and 2001 explored ways to construct better spoken-dialogue systems,
with which users interact via speech-alone to perform relatively complex tasks such
as travel planning. During 2000 and 2001 two large data sets were collected, in which
users used the Communicator systems built by the research groups to do travel planning.
The researchers improved their systems intensively during the ten months between the
two data collections. This distribution consists of all the data from the 2001 collection.
All the Communicator implementations used a common software architecture, called Galaxy-II,
which was designed by a research team at MIT and adapted for Communicator in collaboration
with a team at MITRE. The architecture supported detailed logging of the interaction
between users and the systems. For possible updated information about the Communicator
project and the data distributions, please visit the NIST website. *Data* The following
sites participated in this project: ATT, BBN, Carnegie Mellon University, IBM, Lucent
Bell Labs, MIT, SRI and University of Colorado at Boulder. All audio files have been
converted into SPHERE format; there are 53394 sphere files, totalling approximately
102 hours of audio. All sphere files are one-channel, 8KHz, but the sample coding
and format, while consistent for all files belonging to one site, is not consistent
across sites (for example, some sites provided pcm, while others provided u-law data).
The documentation included in this distribution is replicated exactly as received
from NIST and from the participating sites.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech processing systems
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Marilyn
ADDED ENTRY--PERSONAL NAME
- Personal name:
Aberdeen, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sanders, Gregory
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u bai d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632546
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 973-117-906-652-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
bai
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ybb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Grassfields Bantu Fieldwork: Dschang Tone Paradigms
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Grassfields Bantu Fieldwork: Dschang Tone Paradigms was produced by Linguistic Data
Consortium (LDC) catalog number LDC2003S02 and ISBN 1-58563-254-6. The data contains
tone paradigms of the language Yémba (Bamileke Dschang), a Bamileke (Grassfields Bantu)
language spoken by 300,000+ people in Southwestern Cameroon. *Data* There are 45 paradigm
pages in html format. Each page lists 32 utterances, varying across subject, verb,
and object. Each utterance has one to three links to recordings in .wav format, as
well as a laryngograph recording (also in .wav format). Phonetic transcription has
been done for every utterance tonological transcription has been done for a little
more than half. Recorded: June 1997, Dschang, Western Province, and recording studio
of SIL Cameroon, Yaoundé Digitized, Labelled and Segmented: 1997-1998 Phonetics Laboratory,
University of Edinburgh Transcribed and Annotated: 1998-2002 LDC, University of Pennsylvania
Sponsorship: SIL Cameroon Economic and Social Research Council (UK) Grant R000235540
National Science Foundation (US) Grant 9983258 National Science Foundation (US) TalkBank
Project Grant BCS-998009, KDI, SBE Linguistic Data Consortium
LANGUAGE NOTE
- Language note:
Content in Yemba. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Yemba language
- General subdivision:
Tone.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bird, Steven
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632635
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 977-452-139-220-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Korean Telephone Conversations Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Korean Telephone Conversations Speech was produced by Linguistic Data Consortium (LDC)
catalog number LDC2003S03 and ISBN 1-58563-263-5. The telephone conversations in this
corpus were originally recorded as part of the CALLFRIEND project. The CALLFRIEND
Korean telephone speech was collected by Linguistic Data Consortium primarily in support
of the Language Identification (LID) project, sponsored by the U.S. Department of
Defense. The calls were later transcribed for use in other projects. This publication
consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND
Korean, while the rest of 51 are previously unexposed calls. All 100 conversations
have been transcribed and are published as Korean Telephone Conversations Transcripts.
The recorded conversations are between native speakers of Korean and last up to 30
minutes, of which the transcribed speech covers between 15 to 18 minutes. All speakers
were aware that they were being recorded. They were given no guidelines concerning
what they should talk about. Once a caller was recruited to participate, he/she was
given a free choice of whom to call. Most participants called family members or close
friends. All calls originated in either the United States or Canada. *Data* There
are 100 speech files, totalling approximately 44 hours of audio. All speech files
are in sphere format (shorten-compressed), recorded in two-channel ulaw with a sampling
rate of 8 KHz.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Spoken Korean
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ko, Eon-Suk
ADDED ENTRY--PERSONAL NAME
- Personal name:
Han, Na-Rae
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caravan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u rus d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632775
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 741-782-638-900-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
rus
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
rus
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
West Point Russian Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
West Point Russian Speech was produced by Linguistic Data Consortium (LDC) catalog
number LDC2003S05 and ISBN 1-58563-277-5. The West Point Russian Speech corpus was
developed at the Department of Foreign Languages (DFL) and the Center for Technology
Enhanced Language Learning (CTELL) at the United States Military Academy at West Point.
The purpose of the corpus is to provide a set of recordings for the training and development
of speaker-independent speech recognition systems for use by West Point cadets enrolled
in the Russian language program. *Data* The corpus consists of 4,181 speech files
in SPHERE format, totalling approximately four hours of speech. Approximately 2,290
files are from native informants and 1,891 are from non-native informants. The following
tables show the breakdown of corpus content in terms of male, female, native and non-native
speakers. Number of speakers: male female total native 13 16 29 non-native 16 10 26
totals 29 26 55 Number of speech files: male female total native 1027 1263 2290 non-native
1103 788 1891 totals 2130 2050 4181 The speech data was collected using laptop computers
running Windows NT. Recordings were captured at a sampling rate of 16-bit at 22,050
Hz pcm using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. A visual
display of the sentence, along with a digital recording of the sentence as read by
a native speaker, was presented. The informant pressed the Enter key to record the
utterance. The informant's recording was played back for review and the utterance
was re-recorded if necessary. The collection script consists of 96 sentences with
a total of 528 tokens and 351 types. Each waveform file has a monophone and word level
master label file transcription in HTK-format. A concatenated version of the master
label files at both the word level and the phone level is provided. The lexicon contains
690 distinct orthographic word forms, including all words found in the collection
script.
LANGUAGE NOTE
- Language note:
Content in Russian. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Russian language
- Form subdivision:
Databases.
- General subdivision:
Spoken Russian
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
LaRocca, Stephen A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tomei, Christine
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632724
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 951-825-759-886-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Santa Barbara Corpus of Spoken American English Part II
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Santa Barbara Corpus of Spoken American English Part II was produced by Linguistic
Data Consortium (LDC) catalog number LDC2003S06 and ISBN 1-58563-272-4. Santa Barbara
Corpus of Spoken American English Part II is based on hundreds of recordings of natural
speech from all over the United States, representing a wide variety of people of different
regional origins, ages, occupations, and ethnic and social backgrounds. It reflects
many ways that people use language in their lives: conversation, gossip, arguments,
on-the-job talk, card games, city council meetings, sales pitches, classroom lectures,
political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected
by: University of California, Santa Barbara Center for the Study of Discourse (Director:
John W. Du Bois (UCSB), Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer
(UMass, Boston), and Sandra A. Thompson (UCSB)). Santa Barbara Corpus of Spoken American
English Part II is also part of the International Corpus of English (ICE) (Charles
W. Meyer, Director), representing the American Component. For software and additional
data resources, please refer to the following sites: TalkBank, International Corpus
of English. Part I of the Santa Barbara Corpus of Spoken American English is also
available as LDC2000S85. *Data* The audio data consists of 16 wave format speech files,
recorded in two-channel pcm, at 22,050Hz. The speech files total ~six hours of audio
(1.8GB), representing over 47K-words and over 5K unique words in transcription. Each
speech file is accompanied by two transcripts in which intonation units are time stamped
with respect to the audio recording. The two types of transcripts are defined by the
file extension: .trn and .ca. The text and coding content of specific transcripts
are identical. However, the transcripts with the ".ca" extension are transcripts in
the CHAT format for conversational analysis, formatted for use with the CLAN software,
available from TalkBank. The transcripts with ".trn" extension are structured according
to the LDC Callhome format, for use with a variety of annotation tools. (Please also
note that transcript coding is not presented as in the ICE standard). Personal names,
place names, phone numbers, etc., in the transcripts have been altered to preserve
the anonymity of the speakers and their acquaintances and the audio files have been
filtered to make these portions of the recordings unrecognizable. Pitch information
is still recoverable from these filtered portions of the recordings, but the amplitude
levels in these regions have been reduced relative to the original signal. A separate
filter list file (*.flt) associated with each transcript/waveform file pair is provided
to list the beginning and ending times of the filtered regions. There are 4 .flt files
which are empty because there was no information that needed to be filtered out from
the audio files. The filtering was done using a digital FIR low-pass filter, with
the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded
in and out at the beginning and end of the regions over a 1,000 sample region, roughly
45 milliseconds, to avoid abrupt transitions in the resulting waveform. *Acknowledgements*
The completion and release of this corpus was facilitated by funding extended by the
TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year
grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon
University and the University of Pennsylvania.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Du Bois, John W.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chafe, Wallace L.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meyer, Charles
ADDED ENTRY--PERSONAL NAME
- Personal name:
Thompson, Sandra A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u chi d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 799-085-183-531-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2001 HUB5 Mandarin Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2001 HUB5 Mandarin Transcripts was developed by the Linguistic Data Consortium (LDC).
This publication contains transcripts for twenty CALLHOME Mandarin telephone conversations.
These twenty conversations were used in NIST's 2001 HUB5 Non-English evaluation, and
are published as 2001 HUB5 Mandarin Evaluation (LDC2002S12). *Data* There are 20 data
files in .txt format. The .txt files are transcript files rendered in Mandarin script
orthography, containing the orthographic forms that were used in the original transcription
process. These forms also serve as the head-words in the associated CALLHOME Mandarin
Lexicon (LDC96L15). Please follow these links for a sample transcript: Mandarin script
| GIF format.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632538
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 381-881-716-017-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1998 HUB5 English Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
1998 HUB5 English Transcripts was developed by the Linguistic Data Consortium and
consists of transcripts of 40 English telephone conversations used in the 1998 HUB5
evaluation sponsored by NIST (National Institute of Standards and Technology). The
Hub5 evaluation series focused on conversational speech over the telephone with the
particular task of transcribing conversational speech into text. Its goals were to
explore promising new areas in the recognition of conversational speech, to develop
advanced technology incorporating those ideas and to measure the performance of new
technology. Further information about the evaluation can be found on the NIST HUB5
website. *Data* This release contains transcripts in .txt format for the 40 source
speech data files used in the evaluation: (1) 20 telephone conversations from Swtichboard-2
Phase 1 (LDC98S75) in which recruited speakers were connected through a robot operator
to carry on casual conversations about a daily topic announced by the robot operator
at the start of the call; and (2) 20 telephone conversations from CALLHOME American
English Speech which consists of unscripted telephone conversations between native
English speakers. The corresponding speech data is released as 1998 HUB5 English Evaluation
(LDC2002S10). *Sample* Please follow this link for a sample transcript example.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u ger d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632473
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 738-976-749-370-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 HUB5 German Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1997 HUB5 German Transcripts corpus was produced by the Linguistic Data Consortium
(LDC), catalog number LDC2003T03 and ISBN 1-58563-247-3. This publication contains
transcripts for 20 CALLHOME German telephone conversations. These twenty conversations
were used in NIST's 1997 HUB5 Non-English evaluation, and are published as 1997 HUB5
German Evaluation (LDC2002S24). *Data* There are 20 data files in .txt format. The
.txt files are transcript files containing the orthographic forms that were used in
the original transcription process. These forms also serve as the head-words in the
associated Callhome German Lexicon (LDC97L18). Please follow this link for a sample
transcript.
LANGUAGE NOTE
- Language note:
Content in German. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
German language
- Form subdivision:
Databases.
- General subdivision:
Spoken German
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632481
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 253-136-069-048-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 HUB5 Spanish Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1997 HUB5 Spanish Transcripts corpus was produced by the Linguistic Data Consortium
(LDC), catalog number LDC2003T04 and ISBN 1-58563-248-1. This publication contains
transcripts for 20 Callhome Spanish telephone conversations. These 20 conversations
were used in NIST's 1997 HUB5 Non-English evaluation, and are published as 1997 HUB5
Spanish Evaluation, (LDC2002S25). *Data* There are 20 data files in .txt format. The
.txt files are transcript files containing the orthographic forms that were used in
the original transcription process. These forms also serve as the head-words in the
associated CALLHOME Spanish Lexicon (LDC96L16). Please follow this link for a sample
transcript.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Spanish language
- Form subdivision:
Databases.
- General subdivision:
Spoken Spanish
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632600
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 953-543-425-922-6
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
English Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T05
and ISBN 1-58563-260-0, and is distributed on DVD. This is a comprehensive archive
of newswire text data in English that has been acquired over several years by the
LDC. Four distinct international sources of English newswire are represented here:
Agence France Press English Service (afe) Associated Press Worldstream English Service
(apw) The New York Times Newswire Service (nyt) The Xinhua News Agency English Service
(xie) *Data* Much of the content in this collection has been published previously
by the LDC in a variety of other, older corpora, particularly the North American News
text corpora (LDC95T21, LDC98T30), the various TDT corpora and the AQUAINT text corpus
(LDC2002T31). But there is a significant amount of material that is being released
here for the first time: all of the Agence France Presse content, the 1995 and 2001
Xinhua content, and the portions of NYT and APW dating from February 2001 forward.
Each data file name consists of the three-letter prefix, followed by a six-digit date
(representing the year and month during which the file contents were delivered by
the respective news source), followed by a ".gz" file extension, indicating that the
file contents have been compressed using the GNU "gzip" compression utility (RFC 1952).
So, each file contains all the usable data received by LDC for the given month from
the given news source. All text data are presented in SGML form, using a very simple,
minimal markup structure; all text consists of printable ASCII and whitespace. The
corpus has been fully validated by a standard SGML parser utility (nsgmls), using
a DTD file which is provided as part of this publication. Please follow this link
for a sample file. The markup structure, common to all data files, can be summarized
as follows: The Headline Element is Optional -- not all DOCs have one The Dateline
Element is Optional -- not all DOCs have one Paragraph tags are only used if the "type"
attribute of the DOC happens to be "story" Note that all data files use the UNIX-standard
" " form of line termination, and text lines are generally wrapped to a width of 80
characters or less For this release, all sources have received a uniform treatment
in terms of quality control and we have applied a rudimentary (and _approximate_)
categorization of DOC units into four distinct "types." The classification is indicated
by the "type="string" " attribute that is included in each opening DOC tag. The four
types are: story, multi, advis and other. Statistics regarding the quantities of data
for each source are summarized below. Note that the "Totl-MB" numbers show the amount
of data you get when the files are not compressed (i.e. nearly 12 gigabytes, total);
the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM;
the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all
types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFE 44 417 1216 170969 656269 APW 91 1213 3647 539665 1477466 NYT 96 2104 5906 914159
1298498 XIE 83 320 940 131711 679007 TOTAL 314 4054 11709 1756504 4111240
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632619
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 333-321-196-670-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank: Part 1 v 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Treebank: Part 1 v 2.0 was produced by Linguistic Data Consortium (LDC) catalog
number LDC2003T06 and ISBN 1-58563-261-9. This publication is part one of a a corpus
of one million words of Arabic Treebank, designed to support language research and
development of language technology for Modern Standard Arabic. *Data* The Penn Arabic
Treebank, which is part of the DARPA TIDES project, started in the Fall of 2001 with
the objective of performing human and computer annotations of a large Arabic machine-readable
text corpus (for project background please see POStest.html). As in previous Penn
Treebanks, two different kinds of information need to be produced by two different
(human and computer) processes. The Arabic Treebank project consists therefore of
two distinct phases: * Part-of-Speech (POS) tagging - divides the text into lexical
tokens, and gives relevant information about each token such as lexical category,
inflectional features, and a gloss * Arabic Treebanking (ArabicTB) - characterizes
the constituent structures of word sequences, provides categories for each non-terminal
node, and identifies null elements, co-reference, traces, etc. Both tasks started
in November 2001 with an initial pilot consisting of 734 files representing roughly
166K words of written Modern Standard Arabic newswire from the Agence France Presse
corpus. The target of this publication is to provide a description of a written Modern
Standard Arabic text corpus. The source data consists of Agence France Presse (AFP)
newswire, spanning from July through November of 2000. This publication includes 734
stories representing 140,265 words (168,123 tokens after clitic segmentation in the
Treebank).
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632627
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 908-654-225-543-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank: Part 1 - 10K-word English Translation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Treebank: Part 1 - 10K-word English Translation was produced by Linguistic
Data Consortium (LDC) catalog number LDC2003T07 and ISBN 1-58563-262-7. The purpose
of this corpus of 10K Arabic words translated into English is to support the development
of data-driven approaches to natural language processing, machine translation, human
language technologies, cross-lingual information retrieval, and other forms of linguistic
research on Modern Standard Arabic in general. *Data* The project targets the translation
of a written Modern Standard Arabic corpus from the Agence France Presse (AFP) newswire
archives for July 2000 (the files are dated 20000715*). The corpus consists of 49
source stories, which is a subset of the 734 stories published in Arabic Treebank:
Part 1 v 2.0. These 49 source files consist of 418 paragraphs and 9,981 words. The
source data and the translations are stored in SGML format. The files have been validated
using the DTD provided in the corpus. Please follow these links for an example of
an Arabic source file and the English translation. The stories have been translated
at paragraph level and verified/corrected by different annotators. In general, the
translation between Arabic and English has been aligned at sentence-to-sentence level.
However, we noticed that an Arabic sentence can be translated into multiple English
sentences (16 occurrences), and two Arabic sentences can be translated into a single
English sentence (two occurrences). For 18 paragraphs out of the total of 418 in the
corpus, only paragraph-to-paragraph alignment is provided.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632643
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 248-953-409-804-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Korean Telephone Conversations Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Korean Telephone Conversations Transcripts was produced by Linguistic Data Consortium
(LDC) catalog number LDC2003T08 and ISBN 1-58563-264-3. The telephone conversations
on which these transcripts are based were originally recorded as part of the CALLFRIEND
project. The CALLFRIEND Korean telephone speech was collected by Linguistic Data Consortium
primarily in support of the Language Identification (LID) project, sponsored by the
U.S. Department of Defense. The calls were later transcribed for use in other projects.
This publication consists of 100 transcribed telephone conversations in Korean. The
corresponding speech is published as Korean Telephone Conversations Speech. The Korean
orthographic forms from the 100 trascription files serve as the head-words in the
associated Korean Telephone Conversations Lexicon. The recorded conversations are
between native speakers of Korean and last up to 30 minutes, of which the transcribed
speech covers between 15 to 18 minutes. All speakers were aware that they were being
recorded. They were given no guidelines concerning what they should talk about. Once
a caller was recruited to participate, he/she was given a free choice of whom to call.
Most participants called family members or close friends. All calls originated in
either the United States or Canada. *Data* There are 100 time aligned text files,
totalling approximately 190K words and 25K unique words. All files are in Korean orthography:
orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system,
also known as EUC-KR or ISO-2022-KR. Please follow this link for a sample transcript:
txt | gif.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Spoken Korean
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ko, Eon-Suk
ADDED ENTRY--PERSONAL NAME
- Personal name:
Han, Na-Rae
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632309
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 251-875-847-656-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T09
and ISBN 1-58563-230-9. This is a comprehensive archive of newswire text data that
has been acquired from Chinese news sources by the LDC over several years. Two distinct
international sources of Chinese newswire are represented here: Central News Agency
of Taiwan (cna) Xinhua News Agency of Beijing (xin) Some of the Xinhua content in
this collection has been published previously by the LDC in other, older corpora,
particularly Mandarin Chinese News Text (LDC95T13), TREC Mandarin (LDC2000T52), and
the various TDT Multilanguage Text corpora. But all of the CNA data and a significant
amount of Xinhua material is being released here for the first time. *Data* There
are 286 files, totalling approximately 1.5GB in compressed form. The table below presents
the following categories of information: source of the data, number of files per source,
Gzip-MB shows totals for compressed file sizes, Totl-MB shows totals for uncompressed
file sizes (nearly four gigabytes, total), K-wrds are actually the number of Chinese
characters (there is no notion of "space-separated word tokens" in Chinese), and number
of documents. Source #Files Gzip-MB Totl-MB K-wrds #DOCs CNA 144 1018 2606 735499
1649492 XIE 142 548 1331 382881 817348 TOTAL 286 1566 3937 1118380 2466840 The original
data archives received by the LDC from Xinhua were encoded in GB-2312, whereas those
from CNA were encoded in Big-5. To avoid the problems and confusion that could result
from differences in character-set specifications, all text files in this corpus have
been converted to UTF-8 character encoding. With some exceptions described in the
0readme.txt file, all characters in the text are either single-byte ASCII or multi-byte
Chinese. Each data file name consists of a three-letter prefix, followed by a six-digit
date (representing the year and month during which the file contents were generated
by the respective news source), followed by a ".gz" file extension, indicating that
the file contents have been compressed using the GNU "gzip" compression utility (RFC
1952). So, each file contains all the usable data received by LDC for the given month
from the given news source. All text data are presented in SGML form, using a very
simple, minimal markup structure. The corpus has been fully validated by a standard
SGML parser utility (nsgmls), using a DTD file provided in the corpus. Unlike older
corpora, the present corpus uses only the information structure that is common to
all sources and serves a clear function: headline, dateline, and core news content
(usually containing paragraphs). All sources have received a uniform treatment in
terms of quality control and have been categorized into four distinct "types": story
this type of DOC represents a coherent report on a particular topic or event, consisting
of paragraphs and full sentences multi this type of DOC contains a series of unrelated
"blurbs," each of which briefly describes a particular topic or event: "summaries
of today's news," "news briefs in ..." (some general area like finance or sports),
and so on advis these are DOCs which the news service addresses to news editors, they
are not intended for publication to the "end users" other these DOCs clearly do not
fall into any of the above types; these are things like lists of sports scores, stock
prices, temperatures around the world, and so on The general strategy for categorizing
DOCs into these four classes was, for each source, to discover the most common and
frequent clues in the text stream that correlated with the three "non-story" types.
When none of the known clues was in evidence, the DOC was classified as a "story."
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
History, Modern
- Form subdivision:
Databases.
- Chronological subdivision:
1989-
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632686
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 567-782-098-693-0
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
SAID (A Syntactically Annotated Idiom Dataset) was produced by Linguistic Data Consortium
(LDC) catalog number LDC2003T10 and ISBN 1-58563-268-6. The purpose if this corpus
is to provide data for investigating the structural configurations in which English
idioms are typically found. The assumption was that, since idioms are phrasal lexical
items (PLIs), they would therefore have structural properties which are idiosyncratic.
In order to study the structural properties of phrasal lexical items, the data is
more useful if it is syntactically annotated. *Data* The data was originally drawn
from four dictionaries of English idioms. Only citation forms, suitably adapted for
this purpose, were used. The citation files were amalgamated. The rationale for the
selection was that these are among the biggest and most comprehensive lists of English
idioms. There are 13,467 phrasal lexical items in this corpus. The analysis of the
phrasal lexical items was manual, while the bracketting symmetry was checked computationally.
In order to facilitate machine manipulation of the annotated data, the manual analysis
was converted to PROLOG format. This involved expansions of those PLIs which had optional
constituents so that both the case with and the case without the options were made
available. The files are provided in text format, which each record separated by a
carriage return. *Sponsorship* The New Zealand Vice Chancellors' Committee The University
of Canterbury
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Idioms.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Dictionaries.
- General subdivision:
Idioms
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kuiper, Koenraad
ADDED ENTRY--PERSONAL NAME
- Personal name:
McCann, Heather
ADDED ENTRY--PERSONAL NAME
- Personal name:
Quinn, Heidi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Aitchison, Therese
ADDED ENTRY--PERSONAL NAME
- Personal name:
van der Veer, Kees
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632708
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 498-363-793-174-9
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ACE-2 Version 1.0 was produced by Linguistic Data Consortium (LDC) catalog number
LDC2003T11 and ISBN 1-58563-270-8. This release contains Version 1.0 of the ACE-2
corpus, created and distributed by the LDC to support the Automatic Content Extraction
(ACE) program. The objective of the ACE program is to develop extraction technology
to support automatic processing of source language data (in the form of natural text,
and as text derived from ASR and OCR). This includes classification, filtering, and
selection based on the language content of the source data, i.e., based on the meaning
conveyed by the data. Thus the ACE program requires the development of technologies
that automatically detect and characterize this meaning. The ACE research objectives
are viewed as the detection and characterization of Entities, Relations, and Events.
There are three main ACE tasks: Entity Detection and Tracking, Relation Detection
and Characterization, and Event Detection and Characterization. Annotations for the
ACE-2 corpus were produced by Linguistic Data Consortium to support the following
two research tasks: Entity Detection and Tracking (EDT) and Relation Detection and
Characterization (RDC). For information regarding the ACE program and ACE technology
evaluations administered by the National Institute of Standards and Technology (NIST),
please visit the NIST website. For information about ACE annotation and ongoing ACE
corpus development, including annotation guidelines, task definitions, annotation
tools and other project documentation, please visit the ACE Project page at the LDC.
*Data* This publication contains two sets of data: training and devtest. Each of these
sets is further divided by source: broadcast news, newspaper, and newswire. The training
contains data originally developed as training material for the February 2002 evaluation
and again for the September 2002 evaluation. The devtest contains data originally
developed as test data for the February 2002 evaluation and later used as devtest
data for the September 2002 evaluation. The broadcast and newswire source data is
drawn from a subset of the TDT2 Multilanguage Text Version 4.0 (LDC2001T57); this
has been supplemented with additional newspaper data from the Washington Post. A portion
of the training broadcast data was drawn from the 1997 English Broadcast News Transcripts
(HUB4) corpus (LDC98T28). All material comes from the first half of 1998. The sources
for the broadcast, newswire, and newspaper data are listed below. Newswire New York
Times Newswire Service (NYT) Associated Press Worldstream Service (APW) Broadcast
News Cable News Network, "Headline News" (CNN for TDT2, ed for Hub-4) American Broadcasting
Co., "World News Tonight" (ABC for TDT2, ea for Hub-4) Public Radio International,
"The World" (PRI) Voice of America, English news programs (VOA) MSNBC, "The News With
Brian Williams" (MNB) National Broadcasting Company, "Nightly News" (NBC) Newspaper
Washington Post (WAP) This publication includes both the source data files in .sgm
format and the annotation files in ACE Pilot Format (APF), supporting documentation,
and version 2.0.1 of the ACE DTD which was used for the September 2002 ACE Evaluation.
There are 179,007 words of source data, or 519 files, broken down as follows: Source
# Words train # Words devtest # Files train # Files devtest NYT 32892 7487 48 9 APW
29144 7037 82 20 CNN 2290 2653 69 11 ABC 1588 2687 24 10 PRI 1272 5284 43 9 VOA 594
2611 24 7 MNB 0 2539 0 6 NBC 0 2633 0 8 WAP 60247 15070 76 17 ea 2019 0 31 0 ed 1094
0 25 0 Total 131023 47984 422 97
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content analysis (Communication)
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mitchell, Alexis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Davis, J.K.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grishman, Ralph
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meyers, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brunstein, Ada
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ferro, Lisa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sundheim, Beth
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632716
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 537-362-711-928-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T12
and ISBN 1-58563-271-6. This is a comprehensive archive of newswire text data that
has been acquired from Arabic news sources by the Linguistic Data Consortium (LDC)
at the University of Pennsylvania. Four distinct sources of Arabic newswire are represented
here: Agence France Presse (afa) Al Hayat News Agency (alh) Al Nahar News Agency (ann)
Xinhua News Agency (xin) Much of the AFP content in this collection has been published
previously by the LDC in Arabic Newswire Part 1 (LDC2001T55) and some of this content
has also been included in an Arabic supplement to TDT3 and as the Arabic component
of TDT4. TDT4 also included a four month sample from Al Hayat and An Nahar (October
2000 - January 2001). Apart from that, all of the Al Hayat, An Nahar and Xinhua Arabic
content, as well as AFP content for 2001-2002, is being released here for the first
time. *Data* There are 319 files, totalling approximately 1.1GB in compressed form
(4348 MB uncompressed, and 391619 Kwords). The table below presents the following
categories of information: source of the data, number of files per source, Gzip-MB
shows totals for compressed file sizes, Totl-MB shows totals for uncompressed file
sizes (i.e. approximately 4.3 gigabytes, total), K-wrds are the number of space-separated
tokens in the text, excluding SGML tags. Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFA 104 274 1091 94484 516855 ALH 95 431 1535 139501 305250 ANN 96 415 1530 140247
327768 XIA 24 47 192 17387 106846 TOTAL 319 1167 4348 391619 1256719 All text files
in this corpus have been converted to UTF-8 character encoding. Owing to the use of
UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character
(ASCII) text, whereas lines of actual text data, including article headlines and datelines,
contain a mixture of single-byte and multi-byte characters. In general, single-byte
characters in the text data will consist of digits and punctuation marks (where the
original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation),
whereas multi-byte characters consist of Arabic letters and a small number of special
punctuation or other symbols. This variable-width character encoding is intrinsic
to UTF-8, and all UTF-8 capable processes will handle the data appropriately. Each
data file name consists of the three-letter prefix, followed by a six-digit date (representing
the year and month during which the file contents were generated by the respective
news source), followed by a ".gz" file extension, indicating that the file contents
have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each
file contains all the usable data received by LDC for the given month from the given
news source. All text data are presented in SGML form, using a very simple, minimal
markup structure. The corpus has been fully validated by a standard SGML parser utility
(nsgmls), using the DTD file provided in the publication. Unlike older corpora, the
present corpus uses only the information structure that is common to all sources and
serves a clear function: headline, dateline, and core news content (usually containing
paragraphs). All sources have received a uniform treatment in terms of quality control,
and have been categorized into three distinct "types": story this type of DOC represents
a coherent report on a particular topic or event, consisting of paragraphs and full
sentences multi this type of DOC contains a series of unrelated "blurbs," each of
which briefly describes a particular topic or event: "summaries of today's news,"
"news briefs in ... (some general area like finance or sports)," and so on other these
DOCs clearly do not fall into any of the above types; these are things like lists
of sports scores, stock prices, temperatures around the world, and so on The general
strategy for categorizing DOCs into these three classes was, for each source, to discover
the most common and frequent clues in the text stream that correlated with the "non-story"
types. When none of the known clues was in evidence, the DOC was classified as a "story."
Previous "Gigaword" corpora (in English and Chinese) had a fourth category, "advis"
(for "advisory"), which applied to DOCs that contain text intended solely for news
service editors, not the news-reading public. In preparing the Arabic data, the task
of determining patterns for assigning "non-story" type labels was carried out by a
native speaker of Arabic. For whatever reason, this person did not find the "advis"
category to be applicable to any of the data.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
History, Modern
- Form subdivision:
Databases.
- Chronological subdivision:
1989-
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632392
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 402-267-910-068-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Message Understanding Conference (MUC) 6
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Message Understanding Conference (MUC) 6 was produced by Linguistic Data Consortium
(LDC) catalog number LDC2003T13 and ISBN 1-58563-239-2. In the 1990s, the MUC evaluations
funded the development of metrics and statistical algorithms to support government
evaluations of emerging information extraction technologies. Additional information
from NIST can be found at http://www.itl.nist.gov/iaui/894.02/related_projects/muc.
*Data* This corpus contains the 318 annotated Wall Street Journal articles, the scoring
software and the corresponding documentation used in the MUC6 evaluation. Both the
MUC 6 Additional News Text and the MUC 6 corpus are necessary in order to replicate
the evaluation. All the materials are published as received from the corpus creators,
without any quality control being done at the LDC (the only difference is that the
files have been uncompressed).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Information retrieval
- Form subdivision:
Congresses.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Information storage and retrieval systems
- Form subdivision:
Congresses.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chinchor, Nancy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sundheim, Beth
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632732
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 034-299-958-433-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
SLX Corpus of Classic Sociolinguistic Interviews
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The SLX Corpus of Classic Sociolinguistic Interviews comprises eight sociolinguistic
interviews with a total of nine speakers, conducted in the 1960s and 70s. All of the
interviews are conducted by William Labov or by one of his students. Labov notes that
these interviews are not classic in the sense that they form part of a systematic
sociolinguistic study of the speech community. What makes these interviews classic
is that they represent classic solutions to the problems of achieving cross-cultural
contact, reducing the effect of the Observers Paradox and approximating the vernacular
of everyday life. Most importantly, they are interviews with extraordinarily gifted,
memorable and fluent speakers. These particular interviews have also been targeted
for inclusion in this corpus because of their sound quality and because publication
of the audio data and corresponding transcripts and annotations does not violate any
agreement the interviewer made with the speakers regarding data distribution. The
corpus includes the complete interview recordings plus time-aligned verbatim transcripts
for each speaker. Also included in the publication is a sociolinguistic variable survey
that represents an overview of the intra- and inter-speaker variation attested in
the corpus, highlighting a broad range of phonological, phonetic, grammatical, lexical
and stylistic variables. Finally, the publication includes a number of annotation
tools that allow users to listen to each interview while browsing the corresponding
transcripts, and to display and hear each token identified in the variable survey.
These tools can be extended to create new time-aligned transcripts or tag additional
variables within the existing corpus. The SLX Corpus was developed as part of the
Data and Annotations for Sociolinguistics (DASL) Project, an investigation of best
practices in the use of digital speech corpora for the study of language variation.
Containing classic interview material in the Labovian tradition, it is a valuable
teaching tool for linguists. The recordings demonstrate successful interviewing techniques,
the sound quality is high, and the digitization, segmentation and transcription of
the data represent best practice in these areas. The variable survey highlights over
150 sociolinguistic variables attested in the corpus and suggests avenues for further
research. Most importantly, the SLX Corpus provides both an example of a digital speech
corpus developed specifically to support sociolinguistic research, and a stable benchmark
for training in sociolinguistic data collection, digitization, segmentation, transcription,
analysis and publication. *Data* The 17 speech files are 22050Hz, 16-bit, single-channel
in the MS WAV (RIFF) format, for a total of 575 minutes (~ 1.5GB). The audio data
reflects a broad spectrum of speaking styles, including spontaneous speech, narratives,
responses and formal linguistic tasks. The interviews touch on a multitude of topics,
and corpus users should note that the language of the interviews represents the uncensored
opinions of the speakers, reflecting their daily concerns and personal histories.
Taken as a whole, the speakers exemplify a wide variety of regional and social dialects.
Demographic information for each main speaker in the corpus is displayed in the table
below. Speaker Age Speech Community Occupation Ethnicity Education Adolphus H. 81
Near Hillsboro, NC Farmer African American Very little Bobbie A. 22 Ayr, Scotland
Saw Doctor Scottish/Italian Some technical college Henry G. 60 E. Atlanta, GA (Dekalb
Co.) Railroad foreman European American High school graduate Jerry T. 19 Near Leakey,
Texas Gas station attendant European American Some high school Joe D. (interviewed
with Eddie M.) 21 Liverpool, England Docker English Some high school Eddie M. (Interviewed
with Joe D.) 19 Liverpool, England Docker English Some high school Kathy D. 15 Rochester,
NY Student European American In 11th grade Louise A. 53 Knoxville, TN Mother European
American Unknown Rose B. 43 New York, NY (Lower East Side) Factory seamstress Italian
American Sixth Grade The corpus also contains transcripts, annotations, annotation
tools and documentation. The documentation includes the complete segmentation and
transcription guidelines, descriptions of the variables and style codes used in the
variable survey, demographic information plus Labovs notes about each speaker, and
an instruction manual for using the corpus tools.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Sociolinguistics
- Form subdivision:
Interviews
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Interviews
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Conn, Jeffrey
ADDED ENTRY--PERSONAL NAME
- Personal name:
Evans, Suzanne Wagner
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Labov, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632740
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 352-475-235-734-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
SummBank 1.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T16
and ISBN 1-58563-274-0. SummBank 1.0 contains the data created for the Summer 2001
Johns Hopkins Workshop which focused on text summarization in a cross-lingual information
retrieval framework. For more information about the Johns Hopkins summer workshop
on Text Summarization please visit its website. The goal of the corpus is to gather
together a corpus of original documents and summaries which can be used as gold standards
by the documents summarization community. The source of the data consists of 18,147
aligned bilingual (Cantonese and English) article pairs from the Information Services
Department of the Hong-Kong Special Administrative Region of the People's Republic
of China, which were published by the LDC in 2000 as Hong Kong News Parallel Text.
*Data* This corpus contains 40 news clusters in English and Chinese, 360 multi-document,
human-written non-extractive summaries, and nearly two million single document and
multi-document extracts created by automatic and manual methods. The summarizer that
was reimplemented and upgraded during the workshop is called MEAD; updated versions
of the software are available from the MEAD website. This distribution includes roughly
two million text files, totalling approximately 13GB uncompressed. The text files
are encoded either as utf-8 for English or GB or Big-5 for Chinese.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Radev, Dragomir
ADDED ENTRY--PERSONAL NAME
- Personal name:
Teufel, Simone
ADDED ENTRY--PERSONAL NAME
- Personal name:
Saggion, Horacio
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Blitzer, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Celebi, Arda
ADDED ENTRY--PERSONAL NAME
- Personal name:
Drabek, Elliott
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liu, Danyu
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Allison, Tim
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632759
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 484-381-943-904-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multiple-Translation Chinese (MTC) Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multiple-Translation Chinese (MTC) Part 2 was produced by Linguistic Data Consortium
(LDC) catalog number LDC2003T17 and ISBN 1-58563-275-9. To support the development
of automatic means for evaluating translation quality, the LDC was sponsored to solicit
four sets of human translations for a single set of Mandarin Chinese source materials.
The LDC was also asked to produce translations from various commercial-off-the-shelf-systems
(COTS, including commercial Machine Translation (MT) systems as well as MT systems
available on the Internet). There are a total of six sets of COTS outputs, and one
set of outputs from a TIDES MT Evaluation participant, which is representative for
the state-of-the-art research systems. To see if automatic evaluation systems, such
as BLEU, track human assessment, the LDC has also performed human assessment on two
of the six COTS outputs and the TIDES research system. The corpus includes the assessment
results for these two COTS systems, the assessment result for the TIDES research system,
and the specifications used for conducting the assessments. A similar corpus, Multiple-Translation
Chinese Corpus, was published in 2002. Both the 2002 and the present corpus used Chinese
news articles from the Xinhua and Zaobao News Service, and provide human and COTS
translations. However, Part 2 also offers translations from a TIDES research system,
and provides human assessment of some of the automatic translations. *Data* Source
Data Selection Two sources of journalistic Mandarin Chinese text were selected to
provide the Chinese material: - Xinhua News Service: 70 news stories - Zaobao News
Service: 30 news stories (total: 100 stories) The Xinhua data were drawn from March
and April 2002 collection of Xinhua news. The Zaobao data were drawn from March 2002
collection of Zaobao's online news service. The story selection from the two newswire
collections was controlled by story length: all selected stories contain between about
212 and 707 Chinese characters. The overall count of Chinese characters by source
is shown in the following table: Xinhua 25247 Zaobao 14009 -------------- total 39256
Zaobao is a news portal from Singapore and many of its news stories are translations
from other news agencies' releases. For the Chinese data, there are approximately
20K-words, while for the English translation, there are approximately 258K-words in
total, and 13K unique words. Source Data Preparation for Human Translation The original
source files used GB-2312 encoding for the Chinese characters, and SGML tags for marking
sentence and paragraph boundaries and other information about each story. The character
encoding has been left unaltered. To make things easier for translators, nearly all
sgml tags were removed, or replaced by "plain text" markers. Human Translation Procedure
and Quality Assessment Four best translation teams were chosen from the 11 teams which
had participated in the translation of Multiple Translation Chinese Corpus Part 1
(LDC2002T01) to take part in the project. In accordance with the guidelines, each
translation team was asked to return the first 10 Xinhua stories for quality checking.
This was to ensure that the translation team had indeed understood and was following
the guidelines and the translation quality was acceptable. The LDC sent the translations
back to the translation team for any deviations from the guidelines or quality issues
detected. Subsequent translation submissions were continuously monitored for conformance
and quality. Once the full set of translations was complete, a final pass of reformatting
and validation was carried out, to assure alignability of segments, and to convert
the translated texts into SGML format. Each translation team was also asked to fill
out and return a questionnaire to describe their procedures and professional background.
Machine Translation Procedure Complete sets of automatic MT translations were also
produced by submitting the 100 stories to each of six publicly-available MT systems.
Four of these were commercial MT software packages (off-the-shelf products), and two
were free web-based services. Starting from the original SGML text format, special
alterations were made to the files on an as-needed basis, so that they would be accepted
and handled correctly by the various systems; also, the systems differed in terms
of the input and retrieval methods required to submit the source data for translation
and to save the translated text in alignable form. Human Assessment Procedure The
goal of this effort is to evaluate the quality of TIDES research, human translation
teams and commercial off-the-shelf (COTS) systems. Translations are evaluated on the
basis of adequacy and fluency. Adequacy refers to the degree to which the translation
communicates information present in the original source language text. Fluency refers
to the degree to which the translation is well-formed according to the grammar of
the target language.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632767
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 610-045-411-801-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multiple-Translation Arabic (MTA) Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multiple-Translation Arabic (MTA) Part 1 was produced by Linguistic Data Consortium
(LDC) catalog number LDC2003T18 and ISBN 1-58563-276-7. To support the development
of automatic means for evaluating translation quality, the LDC was sponsored to solicit
ten sets of human translations for a single set of Arabic source materials. The LDC
was also asked to produce translations from various commercial-off-the-shelf-systems
(COTS, including commercial Machine Translation (MT) systems as well as MT systems
available on the Internet). There are a total of two sets of COTS outputs and one
output set from a TIDES 2002 MT Evaluation participant, which is representative for
the state-of-the-art research systems. To see if automatic evaluation systems such
as BLEU track human assessment, the LDC has also performed human assessment on the
two COTS outputs and the TIDES research system. The corpus includes the assessment
results for one of the two COTS systems, the assessment results for the TIDES research
system, and the specifications used for conducting the assessments. *Data* Source
Data Selection Two sources of journalistic Arabic text were selected to provide the
Arabic material: - Xinhua News Service: 66 news stories (files: artb_500 - artb_565)
- AFP News Service: 75 news stories (files: artb_S01 - artb_S06, artb_001 - artb_069)
(total: 141 stories) There are 141 source files, and 1,792 translation files (12 of
the 13 systems produced translations for all 141 source files, while one system produced
translations for only 100 of the 141 Arabic stories). The Xinhua data was drawn from
the Xinhua News Agency's Arabic newswire feed in October 2001. The AFP Data was drawn
from the LDC's Arabic Newswire Part 1). The story selection from the two newswire
collections was controlled by story length: all selected stories contain between 700
and 1,500 Arabic characters. The overall count of Arabic words (excluding markup)
is shown in the following table by source: AFP 12,674 Xinhua 11,155 -------------
23,829 For the Arabic data, there are approximately 23K-words, while for the English
translations, there are 366K-words in total and 163K unique words. Source Data Preparation
for Human Translation The original source files used CP-1256 encoding for the Arabic
characters, and SGML tags for marking sentence and paragraph boundaries and other
information about each story. The source files were later converted to UTF8 encoding.
To make things easier for the translators, nearly all sgml tags were removed or replaced
by "plain text" markers. Human Translation Procedure and Quality Assessment Each initially
selected translation team received the translation guidelines and a sample pair of
source and translation (excluded from the final release) for review. After the team
said that they understood the task requirements and would be willing to participate
in the project, 75 AFP news stories were sent to them as a first installment of data.
In accordance with the guidelines, each translation team was asked to return the first
six AFP stories for quality checking. This was to ensure that the translation team
had indeed understood and was following the guidelines and the translation quality
was acceptable. The LDC sent the translations back to the translation team for any
deviations from the guidelines or quality issues detected. Subsequent translation
submissions were continuously monitored for conformance and quality. Once the full
set of translations was complete, a final pass of reformatting and validation was
carried out, to assure alignability of segments, and to convert the translated texts
into SGML format. Each translation team was also asked to fill out and return a questionnaire
to describe their procedures and professional background. Machine Translation Procedure
Complete sets of automatic MT translations were also produced by submitting the 141
stories to each of the two publicly-available MT systems. Starting from the original
SGML text format, special alterations were made to the files on an as-needed basis,
so that they would be accepted and handled correctly by the various systems. Also,
the systems differed in terms of the input and retrieval methods required to submit
the source data for translation and to save the translated text in alignable form.
Human Assessment Procedure The goal of this effort is to evaluate the quality of TIDES
research, human translation teams and commercial off-the shelf (COTS) systems. Translations
are evaluated on the basis of adequacy and fluency. Adequacy refers to the degree
to which the translation communicates information present in the original source language
text. Fluency refers to the degree to which the translation is well-formed according
to the grammar of the target language.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bamba, Moussa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632694
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003V01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 685-159-396-611-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
FORM2 Kinematic Gesture
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003V01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
FORM2 Kinematic Gesture was produced by the Linguistic Data Consortium. FORM is a
gesture annotation scheme designed to capture the kinematic information in gesture
from videos of speakers. This publication is a detailed database of gesture-annotated
videos stored in the Anvil and FORM file formats. FORM encodes the "phonetics" of
gesture by giving geometric descriptions of location and movement of the right and
left arms. Other kinematic information such as effort and shape are also recorded.
*Data* There are a total of 24 data files: eight movie files, eight Anvil files, and
eight Form files. The movie files represent 12 minutes of audio and video recordings
excerpted from a lecture given by Brian MacWhinney on January 24, 2000 at Carnegie
Mellon University. These video recordings were chosen because they are part of the
NSF-funded Talkbank project. The video format is as follows: Size 360 x 240 pixels
Compression H.261 Data rate 696 K/sec Video rate 29.82 fps Audio rate 48.000 kHz Audio
format 8-bit stereo The gesture annotations were created using the FORM 2.0 tag set.
The Anvil annotation files used in their creation, augmented with FORM 1.0 data, are
also included. (FORM1 data will be the subject of a separate publication to be released
in the near future). FORM1 values that are not included in the FORM2 spec are not
included in the publication. A full description of the FORM tag set with explanations
of each value can be found in the documentation. *Sponsorship* This research was conducted
using funding from the following grant sources: ISLE - 9910603 NSF: TalkBank (via
subcontract from Carnegie Mellon University) - BCS-998009 and BCS-9978056 NSF: Discourse
and Gesture - EIA98-09209
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Gesture
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martell, Craig
ADDED ENTRY--PERSONAL NAME
- Personal name:
Howard, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Osborn, Chris
ADDED ENTRY--PERSONAL NAME
- Personal name:
Britt, Lisa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Myers, Kari
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003V01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u kor d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004L01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 031-806-130-080-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Klex: Finite-State Lexical Transducer for Korean
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004L01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Klex: Finite-State Lexical Transducer for Korean was produced by Linguistic Data Consortium
(LDC) catalog number LDC2004L01 and ISBN 1-58563-283-x. Klex is a finite-state lexical
transducer for the Korean language, with the lexical string on the upper side and
the inflected surface string on the lower side. Klex was developed on the XFST (Xerox
Finite State Tool) software platform, developed and distributed by the Xerox Corporation.
The most common application for such lexical transducers is morphological analysis
and generation. *Data* The distribution consists of ~7,8MB. Characters in Hangul (Korean
alphabet) can be displayed by selecting Korean encoding in your brower. A lexicon
in the form of a transducer has the following basic structure: fly/VV+s/ECS 돕/VV+었/EPF+다/EFN
| | flies 도왔다 A sequence of morphemes along with the respective part-of-speech constitutes
the upper string; a fully lexicalized form constitutes the lower string. A transducer
network as a whole consists of all such possible morpheme sequence / word pairs in
the language. Given the lower lexicalized form, the transducer can produce the analyzed
morpheme sequence (the process of "looking-up"); conversely, the transducer can be
used in producing the fully inflected surface form of grammatical sequence of morphemes
(opposite of "looking-up," hence Xerox's terminology of "looking-down"). These two
operations are the most typical applications of such lexical transducers, namely morphological
analysis and generation. Output of Klex when used as a morphological analyzer is compatible
with the Morphologically Annotated Korean Text corpus. It also conforms to the Korean
Treebank POS annotation standards, with slight variation. The Korean morphological
grammar employed by Klex was constructed by Na-Rae Han, under the guidance of Ken
Beesley, Lauri Karttunen and Martha Palmer. The lexicon was fine-tuned by testing
against various corpora, by fixing undesirable outputs and adding missing lexical
entries. Klex was partially supported by the Korean Treebank Project, whose result
was published in 2002 as the Korean English Treebank Annotations.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Han, Na-Rae
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004L01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633488
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 659-853-066-274-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Santa Barbara Corpus of Spoken American English Part IV
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Santa Barbara Corpus of Spoken American English Part IV was produced by Linguistic
Data Consortium (LDC) catalog number LDC2005S25 and ISBN 158563-348-8. Santa Barbara
Corpus of Spoken American English Part IV is based on hundreds of recordings of natural
speech from all over the United States, representing a wide variety of people of different
regional origins, ages, occupations, and ethnic and social backgrounds. It reflects
many ways that people use language in their lives: conversation, gossip, arguments,
on-the-job talk, card games, city council meetings, sales pitches, classroom lectures,
political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected
by: University of California, Santa Barbara Center for the Study of Discourse (Director:
John W. Du Bois (UCSB), Authors: John W. Du Bois and Robert Englebretson. Associate
Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson
(UCSB)). For software and additional data resources, please refer to the following
sites: TalkBank, International Corpus of English. Part I of the Santa Barbara Corpus
of Spoken American English is available as LDC2000S85. Part II of the Santa Barbara
Corpus of Spoken American English is available as LDC2003S06. Part III of the Santa
Barbara Corpus of Spoken American English is available as LDC2003S10. *Data* The audio
data consists of 14 wave format speech files, recorded in two-channel pcm, at 22050Hz.
The speech files total 5.75 hours of audio (1.5 GB), representing over 58,000 words
and over 6,000 unique words in the transcribed text.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Du Bois, John W.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Englebretson, Robert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634441
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 123-270-985-345-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release is Part 1 of the three-part GALE Phase 1 Arabic Broadcast News Parallel
Text, which, along with other corpora, was used as training data in year 1 (Phase
1) of the DARPA-funded GALE program. This corpus contains transcripts and English
translations of 17 hours of Arabic broadcast news programming selected from a variety
of sources. This corpus does not contain the audio files from which the transcripts
and translations were generated. The audio files will be released by the LDC at a
future date. LDC has released the following GALE Phase 1 & 2 Arabic Parallel Text
data sets: * GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24)
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09) * GALE Phase
1 Arabic Blog Parallel Text (LDC2008T02) * GALE Phase 1 Arabic Newsgroup Parallel
Text - Part 1 (LDC2009T03) * GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
(LDC2009T09) * GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14) * GALE
Phase 2 Arabic Newswire Parallel Text (LDC2012T17) * GALE Phase 2 Arabic Broadcast
News Parallel Text (LDC2012T18) * GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)
*Source Data* A total of 17 hours of Arabic broadcast news recordings was selected
from six sources and seven different programs. A manual selection procedure was used
to choose data appropriate for the GALE program, namely, news and conversation programs
focusing on current events. Stories on topics such as sports, entertainment news,
and stock market reports were excluded from the data set. The following table is a
summary of the files included in this release. Source Program Epoch (YYYY.MM) #hours
#words Al Hurra News 10 2005.11 0.2 959 News 13 2005.04 - 2005.11 3.9 24,430 Dubai
TV Dubai News 2005.01 - 2005.02 1.9 10,842 Lebanese Broadcast Naharkum Saiid 2005.01
- 2005.02 2.0 13,979 Nile TV News 2000.10 0.6 3,671 Voice of America News 2000.06
- 2000.11 5.7 36,925 *Transcription* The selected audio snippets were then carefully
transcribed by LDC annotators and professional transcription agencies following LDC's
Quick Rich Transcription. Manual sentence units/segments (SU) annotation was also
performed as part of the transcription task. Three types of end of sentence SU are
identified: - statement SU - question SU - incomplete SU *Translation* After transcription
and SU annotation, the files were reformatted into a human-readable translation format
and were then assigned to professional translators for careful translation. Translators
followed LDC's GALE translation guidelines, which describe the makeup of the translation
team, the source data format, the translation data format, best practices for translating
certain linguistic features (such as names and speech disfluencies), and quality control
procedures applied to completed translations. TDF Format All final data are in Tab
Delimited Format (TDF). TDF is compatible with other transcription formats, such as
the Transcriber format and AG format, and it is easy to process. Each line of a TDF
file corresponds to a speech segment and contains 13 tab delimited fields (the 13th
field "suType" might be empty): field data_type ----- --------- 1 file unicode 2 channel
int 3 start float 4 end float 5 speaker unicode 6 speakerType unicode 7 speakerDialect
unicode 8 transcript unicode 9 section int 10 turn int 11 segment int 12 sectionType
unicode 13 suType unicode A source TDF file and its translation are the same except
that the transcript in the source TDF is replaced by its English translation. Encoding
All data are encoded in UTF8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zakhary, Dalal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633666
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 024-464-884-415-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Stories v 1.2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on CSLU: Stories V1.2, Linguistic Data Consortium
(LDC) catalog number LDC2006S14 and ISBN 1-58563-366-6. CSLU: Stories contains extemporaneous
speech collected from English speakers in the CSLU Multilanguage Telephone Speech
data collection. Each speaker was asked to speak on a topic of his or her choice for
one minute. Those utterances are collected in the Stories corpus. *Data* The Stories
corpus comprises: * Speech files for the 702 calls * Time-aligned word level transcriptions
(and corresponding comment files) for approximately 322 stories * Word transcriptions
(not time aligned) for 702 stories * Time-aligned phonetic labels for 702 stories
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muthusamy, Yeshwant
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Oshika, Beatrice
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633240
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004L02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 694-194-540-336-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Buckwalter Arabic Morphological Analyzer Version 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004L02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the Buckwalter Arabic Morphological Analyzer Version
2.0 , Linguistic Data Consortium (LDC) catalog number LDC2004T27 and isbn 1-58563-311-9.
*Data* The data consists primarily of three Arabic-English lexicon files: prefixes
(299 entries), suffixes (618 entries), and stems (82158 entries representing 38600
lemmas). The lexicons are supplemented by three morphological compatibility tables
used for controlling prefix-stem combinations (1648 entries), stem-suffix combinations
(1285 entries), and prefix-suffix combinations (598 entries). The actual code for
morphology analysis and POS tagging is contained in a Perl script. The documentation
consists of a readme file with a description of the lexicon files, the morphological
compatibility tables, the morphology analysis algorithm, a summary of stem morphological
categories, and a table with the authors Arabic transliteration system.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and English. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004L02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633704
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 942-053-729-014-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Prague Dependency Treebank 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts
with complex and interlinked morphological (two million words), syntactic (1.5 MW)
and complex semantic annotation (0.8 MW) in addition, certain properties of sentence
information structure and coreference relations are annotated at the semantic level.
PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the
current Computational Linguistics research needs. The corpus itself uses the latest
annotation technology. Software tools for corpus search, annotation and language analysis
are included. Extensive documentation (in English) is provided as well.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Czech. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- Form subdivision:
Databases.
- General subdivision:
Morphology
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- Form subdivision:
Databases.
- General subdivision:
Syntax
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- Form subdivision:
Databases.
- General subdivision:
Semantics
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajič, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Panevová, Jarmila
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajičová, Eva
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sgall, Petr
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pajas, Petr
ADDED ENTRY--PERSONAL NAME
- Personal name:
Štěpánek, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Havelka, Jiří
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mikulová, Marie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Žabokrtský, Zdeněk
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ševčíková-Razímová, Magda
ADDED ENTRY--PERSONAL NAME
- Personal name:
Urešová, Zdeňka
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632805
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 767-326-222-377-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Czech Broadcast News Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Czech Broadcast News Speech contains audio recordings collected from three Czech radio
stations (Cesky rozhlas 1 Radiozurnal - CRo1, Cesky rozhlas 2 Praha - CRo2, Cesky
rozhlas 3 Vltava - CRo3) and two TV channels (Ceska televize - CTV and Prima TV -
Prima). The audio was recorded between February 1 and April 22, 2000, at the Department
of Cybernetics, University of West Bohemia in Pilsen. The corpus was created to support
the development of large vocabulary speaker independent speech recognition systems
for Czech. *Data* There are 286 audio files, totaling approximately 50 hours of broadcast
news. The news does not contain weather forecasts, sports news, or traffic announcements.
The audio files are single-channel, 22.05 kHz, 16 bit linear wav files. The stations,
channels, number of files and number of hours are listed below: Radio Source Files
Hours CRo1 138 30.8 CRo2 90 7.8 CRo3 14 2 TV Source Files Hours CTV 22 5.1 Prima 22
4.2 The corresponding transcripts are available as Czech Broadcast News Transcripts.
The transcripts were created by native Czech speakers working at the Department of
Cybernetics, University of West Bohemia in Pilsen, under the direction of Vlasta Radova.
The transcription was done using software provided by the LDC (Transcriber 1.4.1).
Those parts of the audio recordings that do not contain speech or where the signal
was disrupted were not transcribed. As a consequence, the corpus contains about 23
hours of transcribed speech. The transcriptions are provided in both the ISO-8859-2
and Windows-1250 character set.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Czech. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Radova, Vlasta
ADDED ENTRY--PERSONAL NAME
- Personal name:
Psutka, Josef
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muller, Ludek
ADDED ENTRY--PERSONAL NAME
- Personal name:
Byrne, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Psutka, J.V.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ircing, Pavel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Matousek, Jindrich
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632856
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 723-437-529-684-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ICSI Meeting Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ICSI Meeting Speech was produced by Linguistic Data Consortium (LDC) catalog number
LDC2004S02 and ISBN 1-58563-285-6. The ICSI Meeting corpus is a collection of 75 meetings
collected at the International Computer Science Institute in Berkeley during the years
2000-2002. The meetings included are "natural" meetings in the sense that they would
have occurred anyway: they are generally regular weekly meetings of various ICSI working
teams, including the team working on the ICSI Meeting Project. In recording meetings
of this type, we hoped to capture meeting dynamics and speaking styles that are as
natural as possible given that speakers are wearing close-talking microphones and
are fully cognizant of the recording process. The speech files range in length from
17 to 103 minutes, but generally run just under an hour each. Word-level orthographic
transcriptions are available as ICSI Meeting Transcripts. *Data* The collection includes
922 speech files, for a total of approximately 72 hours of Meeting Room speech. The
speech is structured as one subdirectory per meeting, containing wavefiles for each
channel (and possible .blp file, specifying any censored intervals). The audio was
collected at a 48 kHZ sample-rate, downsampled on the fly to 16 kHz. Audio files for
each meeting are provided as separate time-synchronous recordings for each channel,
encoded as 16-bit linear (big-endian) wavefiles, shorten-compressed in NIST SPHERE
format. The meetings were simultaneously recorded using close-talking microphones
for each speaker (generally head-mounted, but early meetings contain some lapel microphones),
as well as six table-top microphones: four high-quality omnidirectional PZM microphones
arrayed down the center of the conference table, and two inexpensive microphone elements
mounted on a mock PDA. All meetings were recorded in the same instrumented meeting
room. In addition to recording the meetings themselves, the participants were also
asked to read digit strings, similar to those found in TIDIGITS, at the start or end
of the meeting. This small-vocabulary read-speech component of the recordings -- using
the same meeting room, speakers, and microphones -- provides a valuable supplement
to the natural conversational data, allowing a factorization of the speech challenges
offered by the corpus. For all but a dozen of the meetings included in the corpus,
at least some of the participants read digit strings; for the great majority of meetings,
all participants did. The digit readings are included as part of the wavefiles for
the meeting as a whole and are fully transcribed as part of the associated transcripts.
There are a total of 53 unique speakers in the corpus. Meetings involved anywhere
from three to 10 participants, averaging six. The corpus contains a significant proportion
of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe.
*Sponsorship* The collection and preparation of this corpus was made possible in large
part through funding from DARPA, both through the Communicator project and through
a ROAR "seedling," the Swiss IM2 project (National Centre of Competence in Research,
sponsored by the Swiss National Science Foundation), and a supplementary award from
IBM.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computer science.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech processing systems
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Janin, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Edwards, Jane
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ellis, Dan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gelbart, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morgan, Nelson
ADDED ENTRY--PERSONAL NAME
- Personal name:
Peskin, Barbara
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pfau, Thilo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shriberg, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Stolcke, Andreas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wooters, Chuck
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632937
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 575-386-034-582-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2002 NIST Speaker Recognition Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 2002 NIST Speaker Recognition Evaluation corpus was produced by Linguistic Data
Consortium (LDC) catalog number LDC2004S04 and ISBN 1-58563-293-7. The 2002 NIST Speaker
Recognition Evaluation is part of an ongoing series of yearly evaluations conducted
by NIST. These evaluations provide an important contribution to the direction of research
efforts and the calibration of technical capabilities. They are intended to be of
interest to all researchers working on the general problem of text independent speaker
recognition. To this end the evaluation was designed to be simple, to focus on core
technology issues, to be fully supported, and to be accessible. The 2002 NIST Speaker
Recognition Evaluation main data was extracted from the Switchboard Cellular Part
2. The extended data task used two phases of Switchboard II, Phases 2 and 3. This
evaluation also included the first multi-modal task, using data from the FBI voice
database. Supporting documentation for this evaluation may be found on the 2002 NIST
Speaker Recognition Evaluation website. Please consult the NIST evaluation plan for
detailed instructions on using this evaluation material. *Data* There are a total
of 9,153 speech files (6,098 at 8 KHz and 3,055 at 16KHz), all of which are in sphere
format, for a total of ~156 hours. The data was initially distributed by NIST on 13
CD-ROMs (r81_1_1 through r81_13_1). This corpus consists of training and test data
and replicates exactly the content and structure of the 13 CD-ROMs.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632945
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 459-840-211-562-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ISL Meeting Speech Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ISL Meeting Speech Part 1 was produced by Linguistic Data Consortium (LDC) catalog
number LDC2004S05 and ISBN 1-58563-294-5. The ISL Meeting Speech Part 1 is a first
subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected
at the Interactive Systems Laboratories at Carnegie Mellon University in Pittsburgh,
PA during the years 2000-2001. The recorded meetings were either natural meetings
where participants needed to meet in the real world, or artificial meetings, which
were designed explicitly for the purposes of data collection but still had real topics
and tasks. The duration of the meetings in this corpus ranges from eight to 64 minutes
and averages at 34 minutes. Word-level orthographic transcriptions are available as
ISL Meeting Transcripts Part 1. The transcriptions are available as ISL Meeting Transcripts
Part 1. *Data* The collection includes 105 speech files, for a total of approximately
10 hours of meeting speech. The speech for each meeting consists of wave files for
each channel and a wave file containing a mix of all channels. The audio was collected
at a 16 kHz sample-rate. Audio files for each meeting are provided as separate time-synchronous
recordings for each channel, encoded as 16-bit (little-endian) wave files. During
meeting recordings, each speaker wore an individual lapel microphone and was recorded
via an Alesis 8-channel mix board and an ECHO Layla 8-channel sound card. This setup
was designed to obtain a consumer- or application-style sound quality. All meetings
were recorded in the same instrumented meeting area. For an example transcript, please
click here. There are a total of 31 unique speakers in the corpus. Meetings involved
anywhere from three to nine participants, averaging at five. The corpus contains a
significant proportion of non-native English speakers, varying in fluency. *Sponsorship*
The collection and preparation of this corpus was made possible in large part through
funding from DARPA, both through the GENOA project and through ROAR.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech synthesis
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Burger, Susanne
ADDED ENTRY--PERSONAL NAME
- Personal name:
MacLaren, Victoria
ADDED ENTRY--PERSONAL NAME
- Personal name:
Waibel, Alex
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 047-363-770-147-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Switchboard Cellular Part 2 Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Switchboard Cellular Part 2 Audio was devloped by the Linguistic Data Consortium (LDC)
and consists of approximately 200 hours of English telephone conversations collected
by LDC in 2000. The Switchboard cellular collection focused primarily on cellular
phone technology of all service types. The goal was to target 200 subjects balanced
by gender to participate in (10+) five-six minute conversations on cellular phones.
The speech data was collected for research, development, and evaluation of automatic
systems for speech-to-text conversion, talker identification, language identification
and speech signal detection purposes. During the study period, LDC collected a total
of 2,020 calls, or 4,040 sides (2,950 cellular) from 419 participants (2,405 female
speakers, 1,635 male speakers) under varied environmental conditions. *Data* This
release contains speech data files with documentation describing speaker information
(sex, age, education, city and state where raised), call information (date, time,
call duration, Personal Identification Numbers, topic), and audit information (channel
quality, background noise). The documentation also contains reports on clipped files.
Each speech file consists of a 1,024-byte ASCII-formatted Sphere header, followed
by two-channel interleaved mu-law sample data. The mu-law samples represent the actual
digital data transmission from the telephone service provider (MCI), as captured separately
for each side of the telephone conversation by LDC's telephone collection platform.
The header also indicates the caller_pin, callee_pin, topic_id, cellular service/handset
information and speaker demographic information. The data files are not compressed.
Other releases in this series include: Switchboard Cellular Part 1 Audio (LDC2001S13)
Switchboard Cellular Part 1 Transcribed Audio (LDC2001S15) Switchboard Cellular Part
1 Transcription (LDC2001T14) *Sample* Please examine this example audio file to review
a sample of this corpus.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633003
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 111-911-583-428-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
RT-03 MDE Training Data Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
MDE RT-03 Training Data Speech corpus was produced by Linguistic Data Consortium (LDC),
catalog number LDC2004S08 and ISBN 1-58563-300-3. This data was originally created
to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program
in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology that can
take raw Speech-to-Text output and refine it into forms that are of more use to humans
and to downstream automatic processes. The data in this release consists of English
Conversational Telephone Speech (CTS) and Broadcast News (BN) audio data. The corresponding
transcripts and annotations are available as MDE RT-03 Training Data Text and Annotations.
*Data* There are 633 files, totalling approximately 5.39 GB (uncompressed) representing
over 60 hours of recorded speech. There are approximately 20 hours of Broadcast News
and over 40 hours of Conversational Telephone Speech contained in the corpus. The
annotated data was originally developed to support the DARPA EARS Metadata Extraction
(MDE) Program, and was distributed as training data for the RT-03F evaluation cycle.
The CTS data was drawn from the Switchboard-1 Release 2 corpus. The BN speech data
was drawn from the 1997 English Broadcast News Speech (HUB4) corpus, from four distinct
sources: American Broadcasting Company (ABC) (1998, 2001) National Broadcasting Company
(NBC) (1998, 2001) Public Radio International (PRI) (1998) Cable News Network (CNN)
(2001) *Data Format* The audio data in this corpus conforms to the following technical
specifications. Type Format Encoding Channels Sample Rate CTS WAVE u-Law 2 8000/sec
BN WAVE 16-bit PCM 1 16000/sec Note that the data is in wave format. This is the audio
file format that our annotation tool (MDE Tool) supports. Since the annotation data
is best explored with this open-source annotation tool, the WAVE format is our choice
of data format. *Annotations* The transcripts corresponding to this speech have been
annotated for various kinds of metadata. The goal of MDE is to enable technology that
can take raw Speech-To-Text output and refine it into forms that are of more use to
humans and to downstream automatic processes. In simple terms, this means the creation
of automatic transcripts that are maximally readable. To this end, LDC has defined
a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers:
filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals,
and editing terms like "sorry" and "I mean." Edit disfluencies are also identified;
the full extent of the disfluency (or string of adjacent disfluencies) and interruption
points are tagged. Annotators further identify SUs (alternately semantic units, sense
units, syntactic units, slash units or sentence units); that is, units within the
discourse that function to express a complete thought or idea on the part of a speaker.
As with disfluency annotation, the goal of SU labeling is to improve transcript readability,
here by creating a transcript in which information is presented in small, structured,
coherent chunks rather than long turns or stories. There are four types of sentence-level
SUs: statements, questions, backchannels and incomplete SUs. To enhance inter-annotator
consistency, the annotation task also identifies a number of sub-sentence SU boundaries
(coordination and clausal SUs). General information about the EARS MDE Annotation
effort, including free annotation tools, annotation guidelines and additional information
can be found at LDC's main EARS MDE Project Page.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 706-538-229-826-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST Meeting Pilot Corpus Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST Meeting Pilot Corpus Speech consists of approximately 15 hours of English meeting
speech and was collected in the NIST Meeting Data Collection Laboratory for the NIST
Automatic Meeting Recognition Project. The corresponding transcripts are available
as the NIST Meeting Pilot Corpus Transcripts and Metadata, while the video files will
be published later as NIST Meeting Pilot Corpus Video. For more information regarding
the data collection conditions, meeting scenarios, transcripts, speaker information,
recording logs, errata, and other ancillary data for the corpus, please consult the
NIST project website for this corpus. *Data* The data in this corpus consists of 369
SPHERE audio files generated from 19 meetings (comprising about 15 hours of meeting
room data and amounting to about 32 GB) recorded between November 2001 and December
2003. Each meeting was recorded using two wireless "personal" mics attached to each
meeting participant: a close-talking noise-cancelling boom mic and an omni-directional
lapel mic. Each meeting was also recorded using three omni-directional table mics
and a four-channel directional table mic covering 365 degrees (each channel is recorded
in a separate file). Each individual channel was converted from its 48Khz, 24-bits,
linear PCM source format to 16 Khz, 16-bits, linear PCM-sampled audio SPHERE-formatted
files.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Michel, Martial
ADDED ENTRY--PERSONAL NAME
- Personal name:
Stanford, Vincent M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tabassi, Elham
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Laprun, Christophe D.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pratz, Nicolas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lard, Jerome
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633089
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 801-946-303-326-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Santa Barbara Corpus of Spoken American English Part III
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Santa Barbara Corpus of Spoken American English Part III was produced by Linguistic
Data Consortium (LDC) catalog number LDC2004S10 and ISBN 1-58563-308-9. Santa Barbara
Corpus of Spoken American English Part III is based on hundreds of recordings of natural
speech from all over the United States, representing a wide variety of people of different
regional origins, ages, occupations, and ethnic and social backgrounds. It reflects
many ways that people use language in their lives: conversation, gossip, arguments,
on-the-job talk, card games, city council meetings, sales pitches, classroom lectures,
political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected
by: University of California, Santa Barbara Center for the Study of Discourse (Director:
John W. Du Bois (UCSB), Authors: John W. Du Bois and Robert Englebretson. Associate
Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson
(UCSB)). Santa Barbara Corpus of Spoken American English Part III is also part of
the International Corpus of English (ICE) (Charles W. Meyer, Director), representing
the American Component. For software and additional data resources, please refer to
the following sites: Talkbank, International Corpus of English. Part I of the Santa
Barbara Corpus of Spoken American English is available as LDC2000S85. Part II of the
Santa Barbara Corpus of Spoken American English is available as LDC2003S06. *Data*
The audio data consists of 16 wave format speech files, recorded in two-channel pcm,
at 22050Hz. The speech files total ~6 hours of audio (1.8GB), representing over 116K-words
and over 9K unique words in transcription. segment.txt explanation of the information
in segment.tbl segment.tbl collection information about the recordings segment_summaries.txt
brief summaries of audio scenarios speaker.txt explanation of the information in speaker.tbl
speaker.tbl speaker ethnographic, demographic information table.txt description of
file names and informal titles annotations.txt list of conventions and prosodic annotations
The the transcripts are in the following format: .trn format structure 2.660 2.805
JOANNE: But, 2.805 4.685 so these slides be real interesting. 6.140 6.325 KEN: ...
Yeah. 6.325 7.710 I think it'll be real interesting A sample transcript file may be
found here. Personal names, place names, phone numbers, etc., in the transcripts have
been altered to preserve the anonymity of the speakers and their acquaintances and
the audio files have been filtered to make these portions of the recordings unrecognizable.
Pitch information is still recoverable from these filtered portions of the recordings,
but the amplitude levels in these regions have been reduced relative to the original
signal. A separate filter list file (*.flt) associated with each transcript/waveform
file pair is provided to list the beginning and ending times of the filtered regions.
The file sbc040.flt is empty indicating there was no personal information to filter
out. The filtering was done using a digital FIR low-pass filter, with the cut-off
frequency set at 400 Hz. The effect of the filter was gradually faded in and out at
the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds,
to avoid abrupt transitions in the resulting waveform. For a complete listing of the
files, please see file.tbl in the docs directory. *Acknowledgements* The completion
and release of this corpus was facilitated by funding extended by the Talkbank project.
Talkbank is an interdisciplinary research project funded by a five-year grant (BCS-998009,
KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the
University of Pennsylvania. Produced at the LDC by Nii Martey.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Du Bois, John W.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Englebretson, Robert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633119
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 171-813-937-657-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2002 Rich Transcription Broadcast News and Conversational Telephone Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2002 Rich Transcription Broadcast News and Conversational Telephone Speech was produced
by Linguistic Data Consortium (LDC) catalog number LDC2004S11 and ISBN 1-58563-311-9.
This corpus contains the test material used in the 2002 Rich Transcription (RT-02)
Evaluation of Broadcast News and Conversational Telephone Speech, administered by
the NIST Speech Group in the Spring of 2002. The RT-02 Meeting Recognition Evaluation
material is available in a separate distribution. For complete up-to-date information,
see the RT-02 Evaluation Website. The RT-02 Evaluation supported two main evaluation
tasks: * Speech-To-Text (STT) Tasks -- included three processing speeds (1x real time,
10x real time, and unlimited time) for both the Broadcast News (BN) and Conversational
Telephone Speech (CTS) domains. * Metadata Extraction (MDE) Task -- consisted of a
speaker diarization task for the BN and CTS domains. *Data* This distribution of the
RT-02 Evaluation Data contains only Broadcast News and Conversational Telephone Speech
data. Meeting data used in the RT-02 Evaluation is not included in this distribution
and is packaged in a separate distribution. All recordings are in English. The BN
data is composed of six approximately 10-minute excerpts from six different broadcasts.
Each waveform is a SPHERE-headered, single-channel, 16-bit PCM file. The broadcasts
were selected from programs from MNB, PRI, NBC, CNN, VOA and ABC, all collected in
1998. The evaluation excerpts were transcribed to the nearest story boundary. The
CTS data is composed of 60 approximately five-minute excerpts from 60 different conversations:
20 from Switchboard-1 data, 20 from Switchboard-2 data, and 20 from Switchboard Cellular-2
data. Evaluation excerpts were transcribed to the nearest turn. Unlike the BN audio
files where the full broadcasts were provided, the CTS audio files contain only the
evaluation excerpts. Each audio excerpt is a SPHERE-headered, two channel interleaved
8-bit mulaw file. The reference transcripts are also provided in this corpus. The
official format for STT reference data is STM (files with the extension 'stm'), while
the official format for MDE reference data is RTTM (files with the extension 'rttm')
. Files with the extensions 'txt' or 'utf' are the original reference transcripts
before any format conversions, additions of annotations, etc., and are included for
completeness.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Le, Audrey
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u zxx d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633127
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 722-712-209-012-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
zxx
- Language code of text/sound track or separate title:
zxx
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
zxx
- Language code of text/sound track or separate title:
zxx
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TalkBank Ethology Corpus: Field Recordings of Vervet Monkey Calls contains digitized
audio files of field recordings of vervet monkeys (cercopithecus aethiops) collected
by Robert M. Seyfarth and Dorothy L. Cheney in 1977 and 1978. The original recordings
were made on 1/8 inch reel-to-reel tapes, and were digitized at LDC using the sampling
frequency of 44.1 kHz. This publication contains downsampled versions of the recordings.
From 2001 through 2004, Dr. Seyfarth and one of his students annotated the recordings
for the following fields: start time, end time, type (bout, commentary, etc.), recording
date and time, caller ID, recipient ID, context, call type and remarks. This publication
contains annotation files that can be viewed with the accompanying software or with
spreadsheet software, such as Microsoft Excel. *Data* There are 60 audio files (approximately
5 GB), containing approximately 30 hours of recordings. All of the audio files are
in the Microsoft WAV format with the sampling frequency of 22,050 Hz. A sample audio
file may be found here. There are 60 annotation files containing approximately 1,270
annotations of selected audio files in the distributions. All of the annotation files
are tab-separated table files without a header, and use the UNIX-style newline ( ).
A sample annotation may be found here. Annotation files for the following audio files
are not included in this publication: V08, V21, V42, V43, V44, V66, Vb1, Vb2, Vb3,
Vb4, Vb5, Vb6, Vb7, Vb8, Vb9, Vc1, Vc2 and Vc3. *Software* The data/software directory
contains a Windows version of the AGTK TableTrans tool for annotation of audio that
was used in creating the annotations. Requirements: PC running Windows XP, Windows
2000 or WindowsNT Installation: The AGTK TableTrans1_2_2.exe file is a self-extracting
installation program. Double clicking on the file will start the installation program.
Information: Information about the AGTK (the Annotation Graph Toolkit) project is
available at: http://agtk.sf.net/
LANGUAGE NOTE
- Language note:
Content in No linguistic content and Vervet Monkey Calls. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Seyfarth, Robert M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cheney, Dorothy L.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636029
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 125-164-075-830-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
pes
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2006 NIST Speaker Recognition Evaluation Test Set Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2006 NIST Speaker Recognition Evaluation Test Set Part 2 was developed by LDC and
NIST (National Institute of Standards and Technology). It contains 568 hours of conversational
telephone and microphone speech in English, Arabic, Bengali, Chinese, Farsi, Hindi,
Korean, Russian, Spanish, Thai and Urdu and associated English transcripts used as
test data in the NIST-sponsored 2006 Speaker Recognition Evaluation (SRE). The ongoing
series of SRE yearly evaluations conducted by NIST are intended to be of interest
to researchers working on the general problem of text independent speaker recognition.
To this end the evaluations are designed to be simple, to focus on core technology
issues, to be fully supported and to be accessible to those wishing to participate.
The task of the 2006 SRE evaluation was speaker detection, that is, to determine whether
a specified speaker is speaking during a given segment of conversational telephone
speech. The task was divided into 15 distinct and separate tests involving one of
five training conditions and one of four test conditions. Further information about
the test conditions and additional documentation is available at the NIST web site
for the 2006 SRE and within the 2006 SRE Evaluation Plan. LDC previously published
2006 NIST Speaker Recognition Evaluation Training Set and 2006 NIST Speaker Recognition
Evaluation Test Set Part 1. *Data* The speech data in this release was collected by
LDC as part of the Mixer project, in particular Mixer Phases 1, 2 and 3. The Mixer
project supports the development of robust speaker recognition technology by providing
carefully collected and audited speech from a large pool of speakers recorded simultaneously
across numerous microphones and in different communicative situations and/or in multiple
languages. The data is mostly English speech, but includes some speech in Arabic,
Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu. The telephone
speech segments are multi-channel data collected simultaneously from a number of auxiliary
microphones. The files are organized into four types: two-channel excerpts of approximately
10 seconds, two-channel conversations of approximately 5 minutes, summed-channel conversations
also of approximately 5 minutes and a two-channel conversation with the usual telephone
speech replaced by auxiliary microphone data in the putative target speaker channel.
The auxiliary microphone conversations are also of approximately five minutes in length.
The speech files are stored as 8-bit u-law speech signals in separate SPHERE files.
In addition to the standard header fields, the SPHERE header for each file contains
some auxiliary information such as the language of the conversation. English language
time-aligned transcripts in .ctm format were produced using an automatic speech recognition
(ASR) system.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese, Urdu, Thai, Spanish, Russian, Korean, Hindi, Persian, English,
Mandarin Chinese, Bengali, Standard Arabic, Dari, Iranian Persian, Chinese, and Arabic.
Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635871
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 531-416-977-177-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
uzb
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tir
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
tgl
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
pan
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
lao
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
khm
- Language code of text/sound track or separate title:
geo
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
khm
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
wuu
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
uzb
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tir
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
tgl
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
pan
- Language code of text/sound track or separate title:
nan
- Language code of text/sound track or separate title:
lao
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
khm
- Language code of text/sound track or separate title:
kat
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
arz
- Language code of text/sound track or separate title:
ary
- Language code of text/sound track or separate title:
kxm
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
pes
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2008 NIST Speaker Recognition Evaluation Training Set Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2008 NIST Speaker Recognition Evaluation Training Set Part 1 was developed by LDC
and NIST (National Institute of Standards and Technology). It contains 640 hours of
multilingual telephone speech and English interview speech along with time-aligned
transcripts and other materials used as training data in the 2008 NIST Speaker Recognition
Evaluation (SRE). SRE is part of an ongoing series of evaluations conducted by NIST.
These evaluations are an important contribution to the direction of research efforts
and the calibration of technical capabilities. They are intended to be of interest
to all researchers working on the general problem of text independent speaker recognition.
To this end the evaluation is designed to be simple, to focus on core technology issues,
to be fully supported, and to be accessible to those wishing to participate. The 2008
evaluation was distinguished from prior evaluations, in particular those in 2005 and
2006, by including not only conversational telephone speech data but also conversational
speech data of comparable duration recorded over a microphone channel involving an
interview scenario. *Data* The speech data in this release was collected in 2007 by
LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the
International Computer Science Institute (ICSI) at the University of California, Berkley.
This collection was part of the Mixer 5 project, which was designed to support the
development of robust speaker recognition technology by providing carefully collected
and audited speech from a large pool of speakers recorded simultaneously across numerous
microphones and in different communicative situations and/or in multiple languages.
Mixer participants were native English and bilingual English speakers. The telephone
speech in this corpus is predominately English, but also includes the languages identified
above. All interview segments are in English. Telephone speech represents approximately
565 hours of the data, whereas microphone speech represents the other 75 hours. The
telephone speech segments include excerpts in the range of 8-12 seconds and 5 minutes
from longer original conversations. The interview material includes short conversation
interview segments of approximately 3 minutes from a longer interview session. As
in prior evaluations, intervals of silence were not removed. Also, two separate conversation
channels are provided (to aid systems in echo cancellation, dialog analysis, etc.).
There are approximately six files distributed as part of SRE08 where each file is
a 1024 byte header with no audio. However, these files were not included in the trials
or keys distributed in the SRE08 aggregate corpus. English language transcripts in
.cfm format were produced using an automatic speech recognition (ASR) system.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese, Wu Chinese, Vietnamese, Uzbek, Urdu, Tigrinya, Thai, Tagalog,
Spanish, Russian, Panjabi, Min Nan Chinese, Lao, Korean, Central Khmer, Georgian,
Japanese, Italian, Hindi, Persian, English, Mandarin Chinese, Bengali, Egyptian Arabic,
Moroccan Arabic, Northern Khmer, Dari, Iranian Persian, Chinese, and Arabic. Documentation
in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 771-157-694-578-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2005 Spring NIST Rich Transcription (RT-05S) Conference Meeting Evaluation Set was
developed by LDC and NIST (National Institute of Standards and Technology). It contains
approximately 78 hours of English meeting speech, reference transcripts and other
material used in the RT Spring 2005 evaluation. Rich Transcription (RT) is broadly
defined as a fusion of speech-to-text (STT) technology and metadata extraction technologies
providing the bases for the generation of more usable transcriptions of human-human
speech in meetings. LDC has also released 2004 Spring NIST Rich Transcription (RT-04S)
Development Data LDC2007S11 and 2004 Spring NIST Rich Transcription (RT-04S) Evaluation
Data LDC2007S12. RT-05S included the following tasks in the meeting domain: * Speech-To-Text
(STT) -convert spoken words into streams of text * Speaker Diarization (SPKR) -find
the segments of time within a meeting in which each meeting participant is talking
* Speech Activity Detection (SAD) - detect when someone in a meeting space is talking
Further information about the evaluation is available on the RT-05Spring Evaluation
Website. Please note the lecture meeting data is not included in this release. *Data
Description* The data in this release consists of portions of meeting speech collected
between 2001 and 2005 by the IDIAP Research Institutes Augmented Multi-Party Interaction
project (AMI), Martigny, Switzerland International Computer Science Institute (ICSI)
at University of California, Berkeley Interactive Systems Laboratories (ISL) at Carnegie
Mellon University (CMU), Pittsburgh, PA NIST and Virginia Polytechnic Institute and
State University (VT), Blacksburg, VA. Each meeting excerpt contains a head-mic recording
for each subject and one or more distant microphone recordings. Reference transcripts
for the evaluation excerpts were prepared by LDC according to its Meeting Recording
Careful Transcription Guidelines. Those specifications are designed to provide an
accurate, verbatim (word-for-word) transcription, time-aligned with the audio file
and including the identification of additional audio and speech signals with special
mark-up.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech processing systems.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Discourse analysis
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Discourse analysis
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633135
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004S13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 259-501-047-379-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Fisher English Training Speech Part 1 Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004S13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Fisher English Training Speech Part 1 Speech represents the first half of a collection
of conversational telephone speech (CTS) that was created at the LDC during 2003.
It contains 5,850 audio files, each one containing a full conversation of up to 10
minutes. Additional information regarding the speakers involved and types of telephones
used can be found in the companion text corpus of transcripts, Fisher English Training
Speech Part 1, Transcripts (LDC2004T19). The Fisher telephone conversation collection
protocol was created at LDC to address a critical need of developers trying to build
robust automatic speech recognition (ASR) systems. Previous collection protocols,
such as CALLFRIEND and Switchboard-II and the resulting corpora, have been adapted
for ASR research but were in fact developed for language and speaker identification
respectively. Although the CALLHOME protocol and corpora were developed to support
ASR technology, they feature small numbers of speakers making telephone calls of relatively
long duration with narrow vocabulary across the collection. CALLHOME conversations
are challengingly natural and intimate. Under the Fisher protocol, a very large number
of participants each make a few calls of short duration speaking to other participants,
whom they typically do not know, about assigned topics. This maximizes inter-speaker
variation and vocabulary breath while also increasing formality. Previous protocols
such as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive
the collection. Fisher is unique in being platform driven rather than participant
driven. Participants who wish to initiate a call may do so; however the collection
platform initiates the majority of calls. Participants need only answer their phones
at the times they specified when registering for the study. To encourage a broad range
of vocabulary, Fisher participants are asked to speak on an assigned topic which is
selected at random from a list, which changes every 24 hours and which is assigned
to all subjects paired on that day. Some topics are inherited or refined from previous
Switchboard studies while others were developed specifically for the Fisher protocol.
*Data* The individual audio files are presented in NIST SPHERE format, and contain
two-channel mu-law sample data; "shorten" compression has been applied to all files.
Data collection and transcription were sponsored by DARPA and the U.S. Department
of Defense, as part of the EARS project for research and development in automatic
speech recognition.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kimball, Owen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, Dave
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004S13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632813
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 577-603-476-733-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Czech Broadcast News Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Czech Broadcast News Transcripts contains the transcripts corresponding to the Czech
broadcast news audio published as Czech Broadcast News Speech. The audio recordings
were collected from three Czech radio stations (Cesky rozhlas 1 Radiozurnal - CRo1,
Cesky rozhlas 2 Praha - CRo2, Cesky rozhlas 3 Vltava - CRo3) and two TV channels (Ceska
televize - CTV and Prima TV - Prima). The audio was recorded between February 1 and
April 22, 2000, at the Department of Cybernetics, University of West Bohemia in Pilsen.
The corpus was created to support the development of large vocabulary speaker independent
speech recognition systems for Czech. *Data* There are 286 transcripts, corresponding
to the 286 audio files (approximately 50 hours of broadcast news). The transcripts
contain approximatelly 196K words and 27K unique words. The news does not contain
weather forecasts, sports news, or traffic announcements. The stations, channels,
number of files and number of hours are listed below: Radio Source Files Hours CRo1
138 30.8 CRo2 90 7.8 CRo3 14 2 TV Source Files Hours CTV 22 5.1 Prima 22 4.2 The transcripts
were created by native Czech speakers working at the Department of Cybernetics, University
of West Bohemia in Pilsen, under the direction of Vlasta Radova. The transcription
was done using software provided by the LDC (Transcriber 1.4.1). Those parts of the
audio recordings that do not contain speech or where the signal was disrupted, were
not transcribed. As a consequence, the corpus contains about 23 hours of transcribed
speech. The transcriptions are provided both in the ISO-8859-2 and Windows-1250 character
set. For an example transcript please click on this example. *Sponsorship* The completion
of this corpus was facilitated by funding provided by the Ministry of Education of
the Czech Republic (Grants No. MSM235200004 and LN00A063) and by the National Science
Foundation (NSF) project no. IIS-9820687 entitled "1999 Language Engineering Workshop
for Students and Professionals: Integrating Research and Education (WS99)" under the
agreement no. 8004-48231 between the Johns Hopkins University, Baltimore, Maryland,
and the University of West Bohemia in Pilsen, Czech Republic.
LANGUAGE NOTE
- Language note:
Content in Czech. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Radova, Vlasta
ADDED ENTRY--PERSONAL NAME
- Personal name:
Psutka, Josef
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muller, Ludek
ADDED ENTRY--PERSONAL NAME
- Personal name:
Byrne, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Psutka, J.V.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ircing, Pavel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Matousek, Jindrich
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632821
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 530-268-392-589-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank: Part 2 v 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Treebank: Part 2 v 2.0 was produced by Linguistic Data Consortium (LDC) catalog
number LDC2004T02 and ISBN 1-58563-282-1. This publication is the second part of a
corpus of 1,000,000 words of Arabic Treebank, designed to support language research
and development of language technology for Modern Standard Arabic. Part one was released
in 2003 as Arabic Treebank: Part 1 v 2.0, having the source data extracted from Agence
France Press stories. The current Arabic Treebank: Part 2 v 2.0 corpus consists of
stories from Al-Hayat distributed by Ummah. *Data* This corpus includes 501 stories
from the Ummah Arabic News Text. There are a total of 144,199 words (counting non-Arabic
tokens such as numbers and punctuation) in the 501 files - one story per file. New
features of annotation include complete vocalization (including case endings), lemma
IDs, and more specific POS tags for verbs and particles. The corpus contains 125,698
Arabic-only word tokens (prior to the separation of clitics), of which 124,740 (99.24%)
were provided with an acceptable morphological analysis and POS tag by the morphological
parser, and 958 (0.76%) were items that the morphological parser failed to analyze
correctly.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632848
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 338-479-223-657-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Morphologically Annotated Korean Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Morphologically Annotated Korean Text was produced by Linguistic Data Consortium (LDC)
catalog number LDC2004T03 and ISBN 1-58563-284-8. This is a collection of Korean text
with annotated morphological analysis and part-of-speech tags. The source text was
extracted from the Korean Newswire corpus. The newswire corpus is a collection of
Korean Press Agency news articles from June 2, 1994 to March 20, 2000. The portion
included in this release consists of a small number of hand-picked articles. The corpus
is part of the Korean Treebank Phase 2. Between 2001 and 2002, the project was conducted
under subcontract from Cogentex Inc., sponsor number Cogentex 5-33436. The text was
tokenized and then automatically analyzed using Klex. Since there can be multiple
possible morphological analyses, the output was fed through a statistical ranking
system in order to select the best possible analysis for the word in the text environment.
The part-of-speech tagged result was then manually corrected by Seung-yun Yang and
Na-Rae Han, graduate students in the University of Pennsylvania Linguistics Department.
*Data* The data consists of one single file, totalling approximately 880KB in uncompressed
form. The text contains 1,574 sentences with 41,024 words and 77,173 morphemes in
total. The text file is in ksc-5601 encoding. Characters in Hangul (Korean alphabet)
can be displayed with Korean X-terminals such as hanterm, or by selecting Korean encoding
in common web browsers such as Netscape or Internet Explorer. The data is formatted
as follows: one head word per line, the word and its morphologically analyzed output
are separated by a tab. Each morpheme is followed by "/" and its part-of-speech; morphemes
are separated by "+". ^EOS is a special symbol denoting the end of a sentence. Morphologically
analyzed and part-of-speech tagged data can be useful in the following applications:
training of statistical morphological analyzers and part-of-speech taggers, evaluation
of pre-existing morphological analyzers and part-of-speech taggers. The morphologically
tagged output is compatible with Klex: Finite-State Lexical Transducer for Korean.
It also conforms to the Korean Treebank POS annotation standards.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Han, Na-Rae
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633941
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S39
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 972-485-703-759-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Names Release 1.3
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S39
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
A common problem in training and developing speech recognition systems is scarcity
of data, especially particular phonemic contexts. The Center for Spoken Language Understanding
is attempting to address this problem with the Names Corpus. The Names Corpus is a
collection of name utterances, both first and last names, from several thousand different
speakers over the telephone. Name utterances are "spontaneous" in that the subject
is not reading from a word list. Another area of active research is the development
of name Recognition systems. The Names Corpus is a useful resource for addressing
this problem. The utterances in this corpus were taken from many other telephone speech
data collections that have been completed at the CSLU. In most data collections, the
callers were asked to leave their name at some point. Also, the callers would occasionally
leave their name in the midst of another utterance. The names in these situations
were extracted out of the host utterance and added to the Names Corpus. Each file
in the Names Corpus has an orthographic transcription following the CSLU Labeling
Conventions. Also, to take advantage of the phonemic variability, many of the utterances
have been phonetically transcribed. The selection of files to phonetically transcribe
was constrained by a process that selected files that were suspected to contain phonetic
contexts that had not yet been transcribed. Release 1.3 of this corpus contains 24,245
files. All of these have been phonetically labeled. Approximately 40% of the bigram
phonemic contexts possible, without regard to language constraints, are represented.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Names, Personal
- Form subdivision:
Databases.
- General subdivision:
Pronunciation
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muthusamy, Yeshwant
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Oshika, Beatrice
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S39
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632864
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 295-380-961-299-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ICSI Meeting Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ICSI Meeting Transcripts was produced by Linguistic Data Consortium (LDC) catalog
number LDC2004T04 and ISBN 1-58563-286-4. The ICSI Meeting corpus is a collection
of 75 meetings collected at the International Computer Science Institute in Berkeley
during the years 2000-2002. The meetings included are "natural" meetings in the sense
that they would have occurred anyway: they are generally regular weekly meetings of
various ICSI working teams, including the team working on the ICSI Meeting Project.
In recording meetings of this type, we hoped to capture meeting dynamics and speaking
styles that are as natural as possible given that speakers are wearing close-talking
microphones and are fully cognizant of the recording process. The speech files range
in length from 17 to 103 minutes, but generally run just under an hour each. The speech
files are available as ICSI Meeting Speech. *Data* This corpus consists of 75 word-level
transcripts (one transcript file per meeting), time-synchronized to digitized audio
recordings. There are approximately 795 K-words and 13K unique words in the transcripts.
The meetings were recorded with close-talking and far-field microphones. The transcripts
were based mostly on the close-talking microphones, either separately or blended together
in a so-called "mixed" channel. The focus of the transcripts was on capturing the
flow of audible events, especially the words which were spoken, and who spoke them.
Transcripts were prepared by means of the "Channeltrans" interface. Channeltrans is
an extension of the "Transcriber" interface. There are a total of 53 unique speakers
in the corpus. Meetings involved anywhere from three to 10 participants, averaging
six. The corpus contains a significant proportion of non-native English speakers,
varying in fluency from nearly-native to challenging-to-transcribe. *Sponsorship*
The collection and preparation of this corpus was made possible in large part through
funding from DARPA, both through the Communicator project and through a ROAR "seedling,"
the Swiss IM2 project (National Centre of Competence in Research, sponsored by the
Swiss National Science Foundation), and a supplementary award from IBM.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computer science.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech processing systems
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Janin, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Edwards, Jane
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ellis, Dan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gelbart, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morgan, Nelson
ADDED ENTRY--PERSONAL NAME
- Personal name:
Peskin, Barbara
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pfau, Thilo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shriberg, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Stolcke, Andreas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wooters, Chuck
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632872
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 191-685-030-898-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Treebank 4.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Treebank 4.0 was produced by Linguistic Data Consortium (LDC) catalog number
LDC2004T05 and ISBN 1-58563-287-2. The Penn Chinese Treebank is an ongoing project
that started in the summer of 1998. The goal of the project is to create of a 500,000-word
corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published
in 2000. It was later corrected and released in 2001 as Chinese Treebank 2.0. More
information about the project is available on the Penn Chinese Treebank website. The
content used in this corpus comes from the following newswire sources: 698 articles
Xinhua (1994-1998) 55 articles Information Services Department of HKSAR (1997) 80
articles Sinorama magazine, Taiwan (1996-1998 & 2000-2001) *Data* Chinese Treebank
4.0 contains 404,156 words, 664,633 Hanzi, 15,162 sentences, and 838 data files. All
files are GB encoded. The format of Chinese Treebank 4.0 is the same as the Penn English
Treebank. All files have been annotated at least twice. The first pass was done by
one annotator, and the resulting files were checked by a second annotator (second
pass). The corpus also provides seven files intended to serve as the gold standard
annotation. The corpus provides four versions of files: bracketed, raw, segmented
and postagged. The raw, segmented and postagged versions are generated from the bracketed
version and so do not reflect the previous annotation stages.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chiou, Fu-Dong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Tsan-Kuang
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632899
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 026-006-085-012-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multiple-Translation Chinese (MTC) Part 3
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multiple-Translation Chinese (MTC) Part 3 was produced by Linguistic Data Consortium
(LDC) catalog number LDC2004T07 and ISBN 1-58563-289-9. To support the development
of automatic means for evaluating translation quality, the LDC was sponsored to solicit
four sets of human translations for a single set of Mandarin Chinese source materials.
Two similar corpora, Multiple-Translation Chinese Corpus, and Multiple-Translation
Chinese Corpus Part 2 were published in 2002 and 2003. The 2002 corpus (Part 1), 2003
corpus (Part 2), and the present corpus used Chinese news articles from multiple sources
and provide human translations for them. However, Part 1 also offers translations
produced from various commercial-off-the-shelf-systems (COTS). In addition to human
and COTS translations, Part 2 also offers translations from a TIDES research system,
and provides human assessment for some of the automatic translations. *Data* Two sources
of journalistic Mandarin Chinese text were selected to provide the Chinese material:
- AFP News Service: 50 news stories - Xinhua News Service: 50 news stories (total:
100 stories) The data was drawn from the May and June 2002 collection of AFP and Xinhua
news. The story selection from the two newswire collections was controlled by story
length: all selected stories contain between about 230 and 564 Chinese characters.
The overall count of Chinese characters by source is shown in the following table:
AFP 22,135 Xinhua 20,321 --------------- total 42,456 For the Chinese data, there
are approximately 21K-words, while for the English translation, there are approximately
100K-words in total, and 12K unique words. Four best translation teams were chosen
from the 11 teams which had participated in the translation of Multiple Translation
Chinese Corpus Part 1 (LDC2002T01) and Part 2 (LDC2003T17) to take part in the project.
In accordance with the guidelines, each translation team was asked to return the first
ten Xinhua stories for quality checking. This was to ensure that each translation
team had indeed understood and was following the guidelines, and the translation quality
was acceptable. The LDC sent the translations back to the translation team for any
deviations from the guidelines or any quality issues detected. Subsequent translation
submissions were continuously monitored for conformance and quality. Once the full
set of translations was complete, a final pass of reformatting and validation was
carried out, to assure alignability of segments, and to convert the translated texts
into SGML format. Each translation team was also asked to fill out and return a questionnaire
to describe their procedures and professional background. Please click here for a
Chinese and an English example. (Characters in Chinese can be displayed by selecting
Chinese encoding in your brower.)
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistic data analysis
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632902
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 619-530-254-208-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Hong Kong Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Hong Kong Parallel Text was developed by the Linguistic Data Consortium (LDC) and
contains data from three sub-corpora, namely Hong Kong Hansards Parallel Text, Hong
Kong Laws Parallel Text and Hong Kong News Parallel Text. Hong Kong Hansards Parallel
Text contains excerpts from the Official Record of Proceedings of the Legislative
Council of the Hong Kong Special Administrative Region (HKSAR). Hong Kong Laws Parallel
Text contains law codes acquired from the Department of Justice of the HKSAR. Hong
Kong News Parallel Text contains press releases from the Information Services Department
of the HKSAR. Hong Kong Hansards Parallel Text, Hong Kong Laws Parallel Text and Hong
Kong News Parallel Text were published in 2000. The 2000 versions of Hong Kong Hansards
Parallel Text and Hong Kong News Parallel Text are aligned at the document level,
while the 2004 versions are aligned at the sentence level. The 2000 and 2004 versions
of Hong Kong News Parallel Text were aligned using different sentence alignment algorithms.
As a result, the 2004 version has better sentence alignment and it also has slightly
more data than the 2000 version. Chinese text is presented in the traditional script
and encoded as BIG5. *Data* Hong Kong Hansards Hong Kong Hansards contains excerpts
from the Official Record of Proceedings (hansards) of the Legislative Council of the
HKSAR from October 1985 to April 2003. LDC downloaded the hansards, which were in
pdf format, from the official website of HKSAR. A total of 1,428 files (714 in Chinese,
714 in English) were downloaded. One to one correspondence between the English hansards
and the Chinese hansards is indicated by the file names. LDC converted the pdf files
into plain text files using automatic conversion software and segmented the files
at sentence boundaries. Efforts were made to remove tables from all files. Hong Kong
Laws Hong Kong Laws contains statute laws of Hong Kong, downloaded from the Bilingual
Laws Information System (BLIS, http://www.justice.gov.hk/), a searchable electronic
database of the statute laws of Hong Kong, established and updated by the Department
of Justice of the HKSAR, in 2000. The original BLIS database contains statute laws
of Hong Kong in English and Chinese, constitutional instruments, national laws and
other relevant instruments, collections of terms and expressions used in the laws
of Hong Kong and subject indices of Ordinances. This corpus contains only statute
laws of Hong Kong in English and Chinese, constitutional instruments, national laws
and other relevant instruments published up to year 2000. The original files were
in html format, and document level alignment was indicated by file names. LDC converted
the html files into plain text files using automatic conversion software, and segmented
the files at sentence boundaries. Efforts were made to remove tables from all files.
Hong Kong News Hong Kong News contains press releases from July 1997 to October 2003
from the government of HKSAR. The HKSAR publishes press releases in both Chinese and
English on a daily basis. Most press releases are available in both languages, some
were translated from English to Chinese, some were translated from Chinese to English.
The original files were in html format. LDC converted the html files into plain text
files using automatic conversion software. Efforts were made to remove tables from
all files. The original files do not indicate document level alignment in any way.
The document level alignment was performed at LDC using an automatic document aligner.
The document-aligned files were then segmented at sentence boundaries. Sentence alignment
was performed on all data using Champollion, a parallel text sentence alignment tool
developed at LDC. See http://champollion.sourceforge.net for more information about
Champollion. Final Data Format and Validation For the Chinese data, there are approximately
49M-words, while for the English translation, there are approximately 59M-words in
total, and 466K unique words. The following table shows the number of documents, paragraphs,
segments, words and characters for each source. Source #Documents #Paragraphs (English/Chinese)
#Segments (English/Chinese) #English Words #Chinese Characters Hong Kong Hansards
714 642,008/632,173 1,688,278/1,414,573 36,140,737 56,618,181 Hong Kong Laws 42,255
423,192/462,283 451,884/491,719 8,396,243 14,868,621 Hong Kong News 44,621 605,183/603,118
811,638/775,019 14,798,671 26,677,514 Total 87,590 1,670,383/1,697,574 2,951,800/2,681,311
59,335,651 98,164,316
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632929
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 685-740-491-198-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TIDES Extraction (ACE) 2003 Multilingual Training Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TIDES Extraction (ACE) 2003 Multilingual Training Data was produced by Linguistic
Data Consortium (LDC) catalog number LDC2004T09 and ISBN 1-58563-292-9. This corpus
was created and previously distributed by Linguistic Data Consortium as an e-corpus
(catalog number LDC2003E18) to support the September 2003 TIDES Extraction (ACE) program
evaluation. For information regarding the ACE program and ACE technology evaluations
administered by the National Institute of Standards and Technology, please visit the
NIST website. For more information about ACE annotation and ongoing ACE corpus development,
including annotation guidelines, task definitions, annotation tools and other project
documentation, please visit LDC's ACE Project page. The source material for this corpus
consists of broadcast and newswire data drawn from October 2000 through the end of
December 2000. The sources are listed below. Newswire: * Arabic * Agency France Press
(AFA) * Al Hayat (ALH) * An-Nahar (ANN) * Chinese * Xinhua Newswire (XIN) * Zaobao
(ZBN) * English * New York Times Newswire Service (NYT) * Associated Press Worldstream
Service (APW) Broadcast News: * Arabic * Voice of America, Arabic news programs (VAR)
* Nile TV (NTV) * Chinese * China National Radio (CNR) * China Television System (CTS)
* Voice of America, Chinese news programs (VOM) * China TV Program Agency (CTV) *
China Broadcasting System (CBS) * English * Cable News Network, "Headline News" (CNN)
* American Broadcasting Co., "World News Tonight" (ABC) * Public Radio International,
"The World" (PRI) * Voice of America, English news programs (VOA) * MSNBC, "The News
With Brian Williams" (MNB) * National Broadcasting Company, "Nightly News" (NBC) *Data*
Annotations for this corpus were produced by Linguistic Data Consortium to support
the following tasks broken down by language: Arabic * Entity Detection and Tracking
(EDT) Chinese * Entity Detection and Tracking (EDT) * Relation Detection and Characterization
(RDC) English * Entity Detection and Tracking (EDT) * Relation Detection and Characterization
(RDC) This publication includes both the source data files in .sgm format and the
annotation files in ACE Pilot Format (APF), as well as the ACE DTD and supporting
documentation. The data files for each language are divided by source type (bnews,
nwire). For Chinese, the annotation files (.apf.xml) are encoded in UTF8. We have
included source files (.sgm) in both GB and UTF8 encoding. The following tables outline
the word and file counts by language and source. Arabic Source Words Files AFA 11154
66 ALH 7437 20 ANN 7734 20 VAR 8360 57 NTV 7512 43 Total 42197 206 Chinese Source
Characters Files XIN 28157 57 ZBN 25591 42 CNR 4758 21 CTS 7160 22 VOM 18160 42 CTV
6017 18 CBS 8130 19 Total 97973 221 English Source Words Files NYT 18983 24 APW 38222
81 CNN 5706 54 ABC 4453 15 PRI 9785 27 VOA 4203 28 MNB 4356 8 NBC 4976 15 Total 90684
252
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mitchell, Alexis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Davis, J.K.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grishman, Ralph
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meyers, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brunstein, Ada
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ferro, Lisa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sundheim, Beth
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632953
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 751-401-034-298-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ISL Meeting Transcripts Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ISL Meeting Transcripts Part 1 was produced by Linguistic Data Consortium (LDC) catalog
number LDC2004T10 and ISBN 1-58563-295-3. The ISL Meeting Corpus Part 1 is a first
subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected
at the Interactive Systems Laboratories at Carnegie Mellon University in Pittsburgh,
PA during the years 2000-2001. The recorded meetings were either natural meetings
where participants needed to meet in the real world, or artificial meetings, which
were designed explicitly for the purposes of data collection but still had real topics
and tasks. The duration of the meetings in this corpus ranges from eight to 64 minutes
and averages at 34 minutes.The audio files are available as ISL Meeting Speech Part
1. *Data* This corpus consists of 19 word-level transcripts of 18 meetings (one transcription
file per meeting, meeting m039 has two parts, m039a and m039b), time synchronized
to digitized audio recordings. There are approximately 116,200 word tokens and 5,850
unique word types in the transcripts. The meetings were recorded with lapel microphones.
The transcriptions were based on the lapel microphones recordings. The focus of the
transcriptions was on capturing the flow of audible events, especially the words which
were spoken, and who spoke them. The transcriptions contain additional annotations
for spontaneous speech events and disfluencies. Transcriptions were prepared by means
of the TransEdit transcription application. This application was developed for the
transcription of multi-channel recordings and displays a synchronized multi-track
view for all channels of a meeting with listening and segmentation function for each
single channel separately. For an example transcript, please click here. There are
a total of 31 unique speakers in the corpus. Meetings involved anywhere from three
to nine participants, averaging at five. The corpus contains a significant proportion
of non-native English speakers, varying in fluency. *Sponsorship* The collection and
preparation of this corpus was made possible in large part through funding from DARPA,
both through the GENOA project and through ROAR.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Meetings
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech synthesis
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Burger, Susanne
ADDED ENTRY--PERSONAL NAME
- Personal name:
MacLaren, Victoria
ADDED ENTRY--PERSONAL NAME
- Personal name:
Waibel, Alex
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632988
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 026-183-445-670-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank: Part 3 v 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Treebank: Part 3 v 1.0 was produced by Linguistic Data Consortium (LDC) catalog
number LDC2004T11 and ISBN 1-58563-298-8. This publication is the third part of a
corpus of 1,000,000 words of Arabic Treebank, designed to support language resear
ch and development of language technology for Modern Standard Arabic. Part one was
released in 2003 as Arabic Treebank: Part 1 v 2.0, having the source data extracted
from Agence France Press stories. Part two was released in 2004 as Arabic Treebank:
Part 2 v 2.0, having the source data extracted from Al-Hayat distributed by Ummah.
The current Arabic Treebank: Part 3 v 1.0 corpus consists of stories from An Nahar
News Agency. *Data* This corpus includes 600 stories from the An Nahar News Text.
There are a total of 340,281 words (counting non-Arabic tokens such as numbers and
punctuation) in the 600 files - one story per file. New features of annotation include
complete vocalization (including case endings), lemma IDs, and more specific POS tags
for verbs and particles. The corpus contains 293,035 Arabic-only word tokens (prior
to the separation of clitics), of which 290,842 (99.25%) were provided with an acceptable
morphological analysis and POS tag by the morphological parser, and 2,193 (0.75%)
were items that the morphological parser failed to analyze correctly.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633011
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 754-359-961-593-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
RT-03 MDE Training Data Text and Annotations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
MDE RT-03 Training Data Text and Annotations corpus was produced by Linguistic Data
Consortium (LDC), catalog number LDC2004T12 and ISBN 1-58563-301-1. This data was
originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text)
Program in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology
that can take raw Speech-to-Text output and refine it into forms that are of more
use to humans and to downstream automatic processes. The data in this release consists
of English Conversational Telephone Speech (CTS) and Broadcast News (BN) transcripts
and annotations. The corresponding speech data is available as MDE RT-03 Training
Data Speech . *Data* There are 633 files, totalling approximately 747 MB with a total
of 764,978 tokens. The transcripts and annotations cover approximately 20 hours of
Broadcast News and over 40 hours of Conversational Telephone Speech data. The annotated
data was originally developed to support the DARPA EARS Metadata Extraction (MDE)
Program, and was distributed as training data for the RT-03F evaluation cycle. The
CTS data was drawn from the Switchboard-1 Release 2 corpus. The BN speech data was
drawn from the 1997 English Broadcast News Speech (HUB4) corpus, from four distinct
sources: American Broadcasting Company (ABC) (1998, 2001) National Broadcasting Company
(NBC) (1998, 2001) Public Radio International (PRI) (1998) Cable News Network (CNN)
(2001) *Annotations* The transcripts within this corpus have been annotated for various
kinds of metadata. The goal of MDE is to enable technology that can take raw Speech-To-Text
output and refine it into forms that are of more use to humans and to downstream automatic
processes. In simple terms, this means the creation of automatic transcripts that
are maximally readable. To this end, LDC has defined a SimpleMDE annotation task.
Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh"
and "um," discourse markers like "you know," asides and parentheticals, and editing
terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent
of the disfluency (or string of adjacent disfluencies) and interruption points are
tagged. Annotators further identify SUs (alternately semantic units, sense units,
syntactic units, slash units or sentence units); that is, units within the discourse
that function to express a complete thought or idea on the part of a speaker. As with
disfluency annotation, the goal of SU labeling is to improve transcript readability,
here by creating a transcript in which information is presented in small, structured,
coherent chunks rather than long turns or stories. There are four types of sentence-level
SUs: statements, questions, backchannels and incomplete SUs. To enhance inter-annotator
consistency, the annotation task also identifies a number of sub-sentence SU boundaries
(coordination and clausal SUs). The docs directory contains the complete set of SimpleMDE
annotation guidelines used to create this data. *Data Format* The data appears in
two formats. The AG Atlas (ag.xml) format represents the native annotation format,
and utilizes the Annotation Graph Library. This data is best explored using the LDC
MDE Toolkit, which is freely available at http://www.ldc.upenn.edu/Projects/MDE/Tools.
The data is also provided in RTTM format developed by NIST to support the EARS Program.
The RTTM format labels each token in the reference transcript according to the properties
it displays: lexeme vs. non-lexeme; edit, filler, SU, etc. Please click here for a
RTTM file example. General information about the EARS MDE Annotation effort, including
free annotation tools, annotation guidelines and additional information can be found
at LDC's EARS MDE Project Page.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633038
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 682-718-319-529-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST Meeting Pilot Corpus Transcripts and Metadata
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST Meeting Pilot Corpus Transcripts and Metadata was produced by Linguistic Data
Consortium (LDC) catalog number LDC2004T13 and ISBN 1-58563-303-8. This corpus contains
the full speech transcripts created by the Linguistic Data Consortium for the NIST
Automatic Meeting Recognition Project as well as a metadata database with useful information
about the meeting forums, topics, participants and recording conditions and equipment.
The corresponding speech files are available as the NIST Meeting Pilot Corpus Speech,
while the video files will be published later as NIST Meeting Pilot Corpus Video.
For more information, documentation, and updates made after the release of this corpus,
please consult the NIST project website for the corpus. *Data* The data for the NIST
Automatic Meeting Recognition Project was collected at the NIST Gaithersburg, MD Meeting
Data Collection Laboratory and includes 19 meetings (comprising about 15 hours of
data) recorded between November 2001 and December 2003. The full transcriptions included
in this release were created using a "quick" transcription procedure. There are ~151K-words
and 6K unique words. A variety of information was manually recorded during the collection
of the pilot corpus about the subjects and recording setup. This information was stored
in a relational database. A fully-updated online version of the database is available
from the NIST project website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Michel, Martial
ADDED ENTRY--PERSONAL NAME
- Personal name:
Stanford, Vincent M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tabassi, Elham
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Laprun, Christophe D.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pratz, Nicolas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lard, Jerome
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633046
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 874-058-423-080-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Proposition Bank I
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Proposition Bank I was produced by Linguistic Data Consortium (LDC) catalog number
LDC2004T14 and ISBN 1-58563-304-6. This is a semantic annotation of the Wall Street
Journal section of Treebank-2. More specifically, each verb occurring in the Treebank
has been treated as a semantic predicate and the surrounding text has been annotated
for arguments and adjuncts of the predicate. The verbs have also been tagged with
coarse grained senses and with inflectional information. This work was done in the
Computer and Information Sciences Department at the University of Pennsylvania. All
data is the result of double blind, adjudicated annotation. *Data* There are two basic
components to Propbank: * The Verb Lexicon. A frames file, consisting of one or more
frame sets, has been created for each verb occuring in the Treebank. These files serve
as a reference for the annotators and for users of the data. 3,324 such files have
been created, totalling about 5.5 MB of uncompressed data. * The Annotation. There
are approximately 113,000 annotated verb tokens. These verb tokens include all those
occurring in over one million words of the Wall Street Journal section of the Penn
Treebank, excluding 'be' and auxiliary uses of 'do' and 'have.' There are annotations
for over 3,200 unique verbs. These annotations are stored in a single file in standoff
format, totalling ~9.6 MB of uncompressed data.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kingsbury, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Babko-Malaya, Olga
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cotton, Scott
ADDED ENTRY--PERSONAL NAME
- Personal name:
Snyder, Benjamin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633054
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 451-626-470-363-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2000 Communicator Dialogue Act Tagged
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2000 Communicator Dialogue Act Tagged was produced by Linguistic Data Consortium (LDC)
catalog number LDC2004T15 and ISBN 1-58563-305-4. This corpus is an addendum to the
2000 Communicator Evaluation corpus produced by the LDC in 2002. This addendum contains
annotations on the transcriptions of the system and user utterances as taken from
the logfiles of the 2000 Communicator Evaluation corpus. Dialogue Act annotations
are provided for system utterances in the dialogues. The dialogue act tags follow
the DATE (Dialogue Act Tagging for Evaluation) scheme. In addition, both system and
user utterances are tagged for named entities. For further description of the 2000
Communicator Evaluation corpus, please refer to the main publication from 2002 (LDC2002S56).
*Data* The complete Dialogue Act annotated corpus is available as a single XML text
file totalling approximately 16 MB. The total number of dialogues is 648. There are
314,223 words (tokens) and 1,403,985 unique words. Each dialogue is segmented into
system and user turns. The total number of turns for the entire corpus is 24,728 (13,013
system turns and 11,715 user turns). Except for one system, no utterance segmentation
was done within the turns in the logfiles. The number of utterances is therefore the
same as the number of turns. Utterance segmentation is carried out and reflected as
the dialogue act segmentation. The total number of tagged dialogue acts is 22,701
with 61 unique tags. There are a total of 275,938 words in the system utterances and
a total of 38,285 words in the user utterances. Dialogue Act tagging was done automatically
via pattern matching with human-labeled dialogue utterances used by the nine different
participating Communicator Systems. Named entity tagging also followed the same methodology.
*Sponsorship* This research was conducted using funding from the following grant number
and funding agency: DARPA - contract MDA972-99-3-0003.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Dialogues.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Prasad, Rashmi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Marilyn
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633062
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 137-996-514-791-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2001 Communicator Dialogue Act Tagged
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2001 Communicator Dialogue Act Tagged was produced by Linguistic Data Consortium (LDC)
catalog number LDC2004T16 and ISBN 1-58563-306-2. This corpus is an addendum to the
2001 Communicator Evaluation corpus produced by the LDC in 2003. This addendum contains
annotations on the transcriptions of the system and user utterances as taken from
the corrected logfiles of the 2001 Communicator Evaluation corpus. Corrections were
hand-done for missing or misaligned time-stamps on turn/utterance boundaries. Dialogue
Act Annotations are provided for system utterances in the dialogues. The dialogue
act tags follow the DATE (Dialogue Act Tagging for Evaluation) scheme. In addition,
both system and user utterances are tagged for named entities. For further description
of the 2001 Communicator Evaluation corpus, please refer to the main publication from
2003 (LDC2003S01). *Data* The complete Dialogue Act annotated corpus is available
as a single XML text file totalling approximately 67 MB. The total number of dialogues
is 1,683. There are 1,151,330 words (tokens) and 5,343,286 unique words. Each dialogue
is segmented into system and user turns. The total number of turns for the entire
corpus is 78,718 (39,419 system turns and 39,299 user turns). Turns were further segmented
into utterances in the system logfiles. The total number of utterances is 89,666 (39,417
system utterances and 50,249 user utterances). There are a total of 1,048,311 words
in the system utterances and a total of 103,019 words in the user utterances. The
total number of tagged dialogue acts is 82,277 with 68 unique tags. Dialogue Act tagging
was done automatically using pattern matching with human-labeled dialogue utterances
used by the nine different participating Communicator Systems. Named entity tagging
also followed the same methodology. *Sponsorship* This research was conducted using
funding from the following grant number and funding agency: DARPA contract MDA972-99-3-0003.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech processing systems
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content analysis (Communication)
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistic analysis (Linguistics)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Prasad, Rashmi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Marilyn
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633070
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 443-183-109-992-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic News Translation Text Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic News Translation Text Part 1 was produced by Linguistic Data Consortium (LDC)
catalog number LDC2004T17 and ISBN 1-58563-307-0. To support the development of automatic
machine translation systems, the LDC was sponsored to solicit English translations
for a single set of Arabic source materials. The source Arabic text was selected and
translated in different LDC projects during the time period of November 2002 to February
2004. A total of about 441K Arabic words were selected from three sources, namely
Xinhua, AFP, and An Nahar, and translation services were provided by eight translation
agencies who translated each Arabic news story once. The Xinhua and An Nahar stories
and their translations were created for TIDES Machine Translation, while the AFP stories
and their English translations were created for TIDES TDT. The development of all
these translations followed roughly the same guidelines and procedures. *Data* Three
sources of journalistic Arabic text were selected to provide the Arabic material:
- AFP News Service: 250 news stories, October 1998 - December 1998 - Xinhua News Service:
670 news stories, November 2001 - March 2002 - An Nahar: 606 news stories, October
2001 to December 2002 (total: 1,526 stories) The overall count of Arabic words by
source is shown in the following table: AFP 44,193 Xinhua 99,514 An Nahar 297,533
---------------- total 441,240 For the Arabic data, there are 441K-words, while for
the English translation, there are approximately 581K-words in total, and 25K unique
words. Each translation team was provided with translation guidelines. In accordance
with the guidelines, each translation team was asked to return the first five stories
for quality checking in each project. This was to ensure that each translation team
had indeed understood and was following the guidelines, and the translation quality
was acceptable. The LDC sent the translations back to the translation team for any
deviations from the guidelines or any quality issues detected. Subsequent translation
submissions were continuously monitored for conformance and quality. Once the full
set of translations was complete, a final pass of reformatting and validation was
carried out, to assure alignability of segments, and to convert the translated texts
into SGML format. An Arabic-English bilingual LDC employee went through all the source
data and English translations, and fixed any problems that had been found. For the
present release, the corpus content is organized into source and translation directories,
containing 1,526 files in source and 1,526 files in translation, one news story per
file.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zakhary, Dalal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bamba, Moussa
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633100
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 233-597-996-883-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic English Parallel News Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus contains Arabic news stories and their English translations LDC collected
via Ummah Press Service from January 2001 to September 2004. It totals 8,439 story
pairs, 68,685 sentence pairs, 2M Arabic words and 2.5M English words. The corpus is
aligned at sentence level. All data files are SGML documents. Please examine this
Arabic example and this English example to review a sample of this corpus.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633143
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 100-086-600-941-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Fisher English Training Speech Part 1 Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Fisher English Training Speech Part 1 Transcripts represents the first half of a collection
of conversational telephone speech (CTS) that was created at LDC in 2003. It contains
time-aligned transcript data for 5,850 complete conversations, each lasting up to
10 minutes. In addition to the transcriptions, which are found under the trans directory,
there is a complete set of tables describing the speakers, the properties of the telephone
calls, and the set of topics that were used to initiate the conversations. The corresponding
speech files are contained in Fisher English Training Speech Part 1 Speech (LDC2004S13).
The Fisher telephone conversation collection protocol was created at LDC to address
a critical need of developers trying to build robust automatic speech recognition
(ASR) systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II
and the resulting corpora, have been adapted for ASR research but were in fact developed
for language and speaker identification respectively. Although the CALLHOME protocol
and corpora were developed to support ASR technology, they feature small numbers of
speakers making telephone calls of relatively long duration with narrow vocabulary
across the collection. CALLHOME conversations are challengingly natural and intimate.
Under the Fisher protocol, a large number of participants each calls an other participant,
whom they typically do not know, for a short short period of time to discuss the assigned
topics. This maximizes inter-speaker variation and vocabulary breath while also increasing
formality. Previous protocols such as CALLHOME, CALLFRIEND and Switchboard relied
upon participant activity to drive the collection. Fisher is unique in being platform
driven rather than participant driven. Participants who wish to initiate a call may
do so, however, the collection platform initiates the majority of calls. Participants
need only answer their phones at the times they specified when registering for the
study. To encourage a broad range of vocabulary, Fisher participants are asked to
speak about an assigned topic chosen from a randomly generated list that changes every
24 hours. All participants that day will be assigned subjects from that list. Some
topics are inherited or refined from previous Switchboard studies while others were
developed specifically for the Fisher protocol. *Data* Overall, about 12% of the conversations
were transcribed at LDC, and the rest were transcribed by BBN and WordWave using a
significantly different approach to the task. A central goal in both sets was to maximize
the speed and economy of the transcription process. This in turn involved certain
aspects of mark-up detail and quality control that may have been common in previous,
smaller corpora. The LDC transcripts were based on automatic segmentation of the audio
data, to identify the utterance end-points on both channels of each conversation.
Given these time stamps, manual transcription was simply a matter of typing in the
words for each segment and doing a rudimentary spell-check. No attempt was made to
modify the segmentation boundaries manually, or to locate utterances that the segmenter
might have missed. Portions of speech where the transcriber could not be sure exactly
what was said were marked with double parentheses -- (( ... )) -- and the transcriber
could hazard a guess as to what was said, or leave the region between parentheses
blank. The LDC transcription process yields one plain-text transcript file per conversation,
in which the first two lines show the call-ID and the fact that the transcript was
developed at LDC. The remainder of the file contains one utterance per line (with
blank lines separating the utterances), with the start-time, end-time, speaker/channel-ID
and utterance text. Data collection and transcription were sponsored by DARPA and
the U.S. Department of Defense, as part of the EARS project for research and development
in automatic speech recognition.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kimball, Owen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, Dave
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633194
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 034-001-778-929-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Prague Arabic Dependency Treebank 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Prague Arabic Dependency Treebank (PADT) not only consists of multi-level linguistic
annotations over the language of Modern Standard Arabic, but even provides a variety
of unique software implementations designed for general use in Natural Language Processing
(NLP). The PADT project might be summarized as an open-ended activity of the Center
for Computational Linguistics, the Institute of Formal and Applied Linguistics, and
the Institute of Comparative Linguistics, Charles University in Prague, resting in
multi-level annotation of Arabic language resources in the light of the theory of
Functional Generative Description . The project is a younger sibling to Prague Dependency
Treebank for Czech, and is maintained upon co-operation with the Linguistic Data Consortium,
University of Pennsylvania, who release non-annotated corpora of Arabic newswire and
develop an independent Penn Arabic Treebank. *Data * The corpus of PADT 1.0 consists
of morphologically and analytically annotated newswire texts of Modern Standard Arabic,
which originate from the Arabic Gigaword and the plain data of Penn Arabic Treebank,
Part 1 and Penn Arabic Treebank, Part 2. The PADT 1.0 distribution comprises over
113,500 tokens of data annotated analytically and provided with the disambiguated
morphological information. In addition, the release includes complete annotations
of MorphoTrees resulting in more than 148,000 tokens, 49,000 of which have received
the analytical processing. The contents are further divided into data sets as indicated
in the Table. Data Set[A] Tokens [M]Tokens/ParaTokens/DocOriginal Data ProviderNews
PeriodRelated Corpora AFP 13,000 N/A 34.6 [N/A] 260 [N/A] Agence France Presse July
2000 Penn ATB Part 1 UMH 38,500 N/A 43.6 [N/A] 290 [N/A] Ummah Press Service Spring
2002 Penn ATB Part 2 XIN 13,500 N/A 31.2 [N/A] 155 [N/A] Xinhua News Agency May 2003
Arabic Gigaword ALH 10,000 73,500 47.0 [47.8] 405 [405] Al Hayat News Agency September
2001 Arabic Gigaword ANN 12,500 25,500 60.3 [50.3] 740 [630] An Nahar News Agency
November 2002 Arabic Gigaword XIA 26,500 49,500 29.7 [25.9] 235 [205] Xinhua News
Agency May 2003 Arabic Gigaword In the Table, tokens give the number of syntactic
units that are annotated [A] analytically [M] within MorphoTrees. Approximate ratios
of tokens per paragraph and tokens per document come in the next columns, distinguishing
the two types of annotation. The sets of selected documents could cover only a couple
of days of the specified period of time.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Cross-language information retrieval.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Information retrieval.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajič, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Smrz, Otakar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zemanek, Petr
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pajas, Petr
ADDED ENTRY--PERSONAL NAME
- Personal name:
Snaidauf, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Beska, Emanuel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kracmar, Jakub
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hassanova, Kamila
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2004 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633216
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2004T25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 557-838-231-104-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Prague Czech-English Dependency Treebank 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2004]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2004T25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Prague Czech-English Dependency Treebank 1.0 (PCEDT) was produced by Linguistic Data
Consortium (LDC) catalog number LDC2004T25 and ISBN 1-58563-321-6. This corpus was
developed at the Center for Computational Linguistics in cooperation with the Institute
of Formal and Applied Linguistics. PCEDT 1.0 is a corpus of Czech-English parallel
resources suitable for experiments in machine translation, with a special emphasis
on dependency-based (structural) translation (with evaluation data provided for Czech-to-English
systems). *Data* The core part of PCEDT 1.0 is a Czech translation of 21,600 English
sentences from the Wall Street Journal, which are part of the Penn Treebank corpus.
Sentences of the Czech translation were automatically morphologically annotated and
parsed into two levels (analytical and tectogrammatical) of dependency structures
introduced in the theory of Functional Generative Description and closely related
to the Prague Dependency Treebank project. The original English sentences were transformed
from the Penn Treebank phrase-structure trees into dependency representations. A heldout
(development and evaluation) set of 515 sentence pairs was selected and manually annotated
on tectogrammatical level in both Czech and English; for the purposes of quantitative
evaluation, this set has been retranslated from Czech into English by four different
translation companies. PCEDT 1.0 also contains a parallel Czech-English corpus of
plain text from Reader's Digest 1993-1996 consisting of 53,000 parallel sentences,
and a large monolingual corpus of Czech (2.4 M sentences). The included Czech-English
translation dictionary consists of 46,150 translation pairs in its lemmatized version
and 496,673 pairs of word forms, where for each entry-translation pair all corresponding
word form pairs have been generated. Also included is an English-Czech dictionary
provided by Milan Svoboda under GNU/FDL license; this dictionary contains multi-word
translations in 115,929 translation pairs. The next version of PCEDT intends to translate
the whole Wall Street Journal part of the Penn Treebank, and to include reference
retranslations for Czech. As the manual for tectogrammatical annotation of English
gets created, the proportion of manually annotated data will increase. *Sponsorship*
PCEDT 1.0 has been supported by the following grants and projects: * Ministry of Education
of the Czech Republic project No. LN00A063 (Center for Computational Linguistics)
* National Science Foundation grant No. IIS-0121285
LANGUAGE NOTE
- Language note:
Content in Czech and English. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Curin, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cmejrek, Martin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Havelka, Jiří
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajič, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kubon, Vladislav
ADDED ENTRY--PERSONAL NAME
- Personal name:
Žabokrtský, Zdeněk
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2004T25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u man d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633372
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005L01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 592-356-503-307-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
man
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
mxx
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005L01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Mawukakan Lexicon is the first publication of an ongoing project at the Linguistic
Data Consortium (LDC) aiming to build an electronic dictionary of three Mandekan [Eastern
Manding languages of the Mande Group of the Niger-Congo family] languages. The other
variants of Mandekan involved are the Bambara or Bamanankan [Mali] and the Maninka
or Maninkakan [Guinea-Conakry]. The lack of written tradition makes such a dictionary
project extremely important. Our expectation is that once this initial goal is reached,
it will become easier to extend the dictionary to all the other varieties of Mandekan.
LDC released a Maninkakan Lexicon (LDC2013L01) in 2013 and a Bamanankan Lexicon (LDC2016L01)
in 2016. For the dictionary of a small language like Mawukakan (less than half of
a million speakers) to be the most useful, it has to combine the linguistic component
with a cultural component. The fact that the Mawukakan - English lexicon is coupled
with a Mawukakan - French one makes this project a bit more important, given that
Mawukakan speakers live mostly in the francophone area of West Africa. The project
consists of the collection of the largest amount of data possible on the Mandekan
and the Manding culture and making it available electronically for the research community.
To pursue the objective defined above, the concept of dictionary-making adopted is
the one being developed at the LDC. It is a very revolutionary approach to the lexicology
of the languages which do not have a writing tradition. The originality of this project
resides in the fact that it extends the scope of the lexical database behind that
of a simple list of lexical entries. It suggests that the database includes, in addition
to the lexicon itself, anything that can help to a better knowledge of the language
and the culture of the Manding people. That means a collection of audio and video
recordings, as well as all of the written materials available on the language and
the culture. By adopting this concept, it becomes easy to preserve small languages
like Mawukakan from a speedy death. In fact, apart from creating the best conditions
for the popularization and standardization of the writing system of the concerned
languages, the new approach to lexicology can contribute to minimizing significantly
the cost of the research on those languages and cultures. Creating the largest electronic
database possible on Mandekan and the Manding culture is more than suitable. The availability
of such databases will contribute to an exposure of all of the aspects of that language
and its culture. That can help popularize the writing system of the language and trigger
more interest for its study. An access to such reference can only affect positively
the research on the concerned language and culture. Because of its electronic nature,
updating the database will be an easy and permanent exercise, dependent on the feedback
received in reaction. That situation will also reduce to the strict minimum the number
of fieldtrips the researchers interested in Mandekan need to take. Todays technological
breakthroughs make such objectives not very difficult to reach. The powerful computers
and modern dictionary-making tools available now have transformed lexicology into
a more exciting enterprise than before. As for the Mawukakan lexicon we are making
available here, anyone with a minimal training in linguistics should be capable of
exploiting it. But for maximum and efficient exploitation of the global database which
will result from this project, an initiation to the linguistic disciplines of Phonetics,
Phonology and Syntax will be the most helpful. Our hope is that the final product
be exploitable by linguists as well as all other researchers who are interested in
the language and the culture of the Mandenkas. This explains, in part, our choice
of the International Phonetic Alphabet [IPA] as the main transcription system of the
Mawukakan. IPA is the least costly system for transcribing the African languages without
a writing tradition, and for which an alphabet was defined only at the end of the
1950s, a period in which most of the colonies in West Africa became independent. *Data*
Both the Toolbox and the XML versions of this dictionary use the Unicode (UTF-8) encoding.
The Doulos SIL Unicode font works well, as do a number of other fonts displaying ASCII,
Latin-1 (U+00A0 through U+00FF), Latin Extended-A (U+0100 through U+017F), Latin Extended
B (U+0180 through U+024F), IPA extensions (U+0250 through U+02AF) and Combining Diacritic
Marks (U+300 through U+36F). *Acknowledgement* Meghan Glenn served as an editor for
the French and English parts of this Lexicon.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mahou, French, and English. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Mandingo language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Mandingo language
- Form subdivision:
Dictionaries.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bamba, Moussa
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005L01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633259
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 852-500-494-190-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic CTS Levantine Fisher Training Data Set 3, Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic CTS Levantine Fisher Training Data Set 3 Speech consists of 322 conversations,
representing a total of about 50 hours of Levantine Arabic speech. The corresponding
human annotated transcripts are contained in Arabic CTS Levantine Fisher Training
Data Set 3, Transcripts (LDC2005T03). The Fisher telephone conversation collection
protocol was created at LDC to address a critical need of developers trying to build
robust automatic speech recognition (ASR) systems. Previous collection protocols,
such as CALLFRIEND and Switchboard-II and the resulting corpora, have been adapted
for ASR research but were in fact developed for language and speaker identification
respectively. Although the CALLHOME protocol and corpora were developed to support
ASR technology, they feature small numbers of speakers making telephone calls of relatively
long duration with narrow vocabulary across the collection. CALLHOME conversations
were challengingly natural and intimate. Under the Fisher protocol, a very large number
of participants each made a few calls of short duration speaking to other participants,
whom they typically did not know, about assigned topics. This maximized inter-speaker
variation and vocabulary breadth although it also increased formality. Previous protocols
such as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive
the collection. Fisher was unique in being platform driven rather than participant
driven. Participants who wished to initiate a call did so; however, the collection
platform initiated the majority of calls. Participants simply answered their phones
at the times they specified when registering for the study. To encourage a broad range
of vocabulary, Fisher participants were asked to speak about an assigned topic chosen
from a randomly generated list that changed every 24 hours. All participants that
day were assigned subjects from that list. Some topics were inherited or refined from
previous Switchboard studies while others were developed specifically for the Fisher
protocol.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Levantine Arabic and South Levantine Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632961
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 500-300-564-790-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts consists of transcribed,
spontaneous speech recorded from subjects speaking Levantine colloquial Arabic. Levantine
Arabic is the dialect of Arabic spoken in Lebanon, Jordan, Syria, and Palestine. It
is significantly different from Modern Standard Arabic. It is a spoken rather than
a written language, and includes different words and pronounciations from Modern Standard
Arabic. The corpus was developed with funding from the Defense Advanced Research Project
Agency (DARPA), as part of the Babylon program. The Babylon program was intended to
advance the state of the art in speech-to-speech translation systems by creating new
technology and by developing systems for field use. BBN was funded under Babylon to
develop a limited English/Arabic refugee/medical speech translation system for a handheld
computer, and it collected this corpus as part of its work. The corpus may be useful
for speech recognition in Levantine colloquial Arabic, including for speech translation
and spoken dialog systems.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Levantine Arabic and South Levantine Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
BBN Technologies (with American University of Beirut a subcontractor)
ADDED ENTRY--PERSONAL NAME
- Personal name:
Makhoul, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zawaydeh, Bushra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Choi, Frederick
ADDED ENTRY--PERSONAL NAME
- Personal name:
Stallard, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633380
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 381-876-213-869-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT4 Multilingual Broadcast News Speech Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TDT4 Multilingual Broadcast News Speech Corpus was developed by the Linguistic Data
Consortium (LDC) with support from the DARPA TIDES (Translingual Information Detection,
Extraction and Summarization) Program. This release contains the complete set of American
English, Modern Standard Arabic and Mandarin Chinese broadcast news audio used in
the 2002 and 2003 Topic Detection and Tracking technology evaluations. The transcripts
corresponding to the audio contained in this release, along with newswire data and
topic relevance annotations, can be found in LDC Publication LDC2005T16, TDT4 Multilingual
Text and Annotations. Topic Detection and Tracking (TDT) refers to automatic techniques
for finding topically related material in streams of data such as newswire and broadcast
news. Evaluation tasks in 2002 and 2003 included the segmentation of a news source
into stories, the tracking of known topics, the detection of unknown topics, the detection
of initial stories on unknown topics, and the detection of pairs of stories on the
same topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633356
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 050-970-085-362-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Fisher English Training Part 2, Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Fisher English Training Part 2 Speech represents the second half of a collection of
conversational telephone speech (CTS) that was created at the LDC during 2003. It
contains 5,849 audio files, each one containing a full conversation of up to ten minutes.
Additional information regarding the speakers involved, and types of telephones used,
can be found in the companion text corpus of transcripts, Fisher English Training
Part 2, Transcripts (LDC2005T19). The Fisher telephone conversation collection protocol
was created at LDC to address a critical need of developers trying to build robust
automatic speech recognition (ASR) systems. Previous collection protocols, such as
CALLFRIEND and Switchboard-II and the resulting corpora, have been adapted for ASR
research but were in fact developed for language and speaker identification respectively.
Although the CALLHOME protocol and corpora were developed to support ASR technology,
they feature small numbers of speakers making telephone calls of relatively long duration
with narrow vocabulary across the collection. CALLHOME conversations are challengingly
natural and intimate. Under the Fisher protocol, a very large number of participants
each make a few calls of short duration speaking to other participants, whom they
typically do not know, about assigned topics. This maximizes inter-speaker variation
and vocabulary breath while also increasing the formality. Previous protocols such
as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive
the collection. Fisher is unique in being platform driven rather than participant
driven. Participants who wish to initiate a call may do so; however the collection
platform initiates the majority of calls. Participants need only answer their phones
at the times they specified when registering for the study. To encourage a broad range
of vocabulary, Fisher participants are asked to speak on an assigned topic which is
selected at random from a list, which changes every 24 hours and which is assigned
to all subjects paired on that day. Some topics are inherited or refined from previous
Switchboard studies while others were developed specifically for the Fisher protocol.
*Data* The first half of the collection (Fisher English Training Speech Part 1) was
released by the LDC in 2004 (LDC2004S13 for speech data, LDC2004T19 for transcripts).
Taken as a whole, the two parts comprise 11,699 recorded telephone conversations.
The individual audio files are presented in NIST SPHERE format, and contain two-channel
mu-law sample data; "shorten" compression has been applied to all files. Data collection
and transcription were sponsored by DARPA and the U.S. Department of Defense, as part
of the EARS project for research and development in automatic speech recognition.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kimball, Owen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, Dave
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633232
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 426-628-131-806-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Treebank 5.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Treebank 5.0 was produced by Linguistic Data Consortium (LDC) catalog number
LDC2005T01 and ISBN 1-58563-323-2. The Penn Chinese Treebank is an ongoing project
that started in the summer of 1998. The goal of the project is to create a 500,000-word
corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published
in 2000, and it was later corrected and released in 2001 as Chinese Treebank 2.0.
Another updated version was released in 2004 as Chinese Treebank 4.0. More information
about the project is available on the Penn Chinese Treebank website. The content used
in this corpus comes from the following newswire sources: 698 articles Xinhua (1994-1998)
55 articles Information Services Department of HKSAR (1997) 132 articles Sinorama
magazine, Taiwan (1996-1998 & 2000-2001) *Data* Chinese Treebank 5.0 contains 507,222
words, 824,983 Hanzi, 18,782 sentences, and 890 data files. All files are GB encoded.
The format of Chinese Treebank 5.0 is the same as the Penn English Treebank. All files
have been annotated at least twice. The first pass was done by one annotator, and
the resulting files were checked by a second annotator (second pass). Some files were
also double-blind annotated and then adjudicated to create gold standard files. The
corpus provides four versions of files: bracketed, raw, segmented and postagged. The
raw, segmented and postagged versions are generated from the bracketed version and
so do not reflect the previous annotation stages. The bracketed files are sequentially
named as follows: chtb_nnnn.fid, where nnnn is a sequential file number.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chiou, Fu-Dong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Tsan-Kuang
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633305
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 692-072-445-089-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
To support the development of data-driven approaches to natural language processing
(NLP), human language technologies, automatic content extraction (topic extraction
and/or grammar extraction), cross-lingual information retrieval, information detection,
and other forms of linguistic research on Modern Standard Arabic in general, the LDC
was sponsored to develop an Arabic Treebank of 1,000,000 words. This corpus is a re-release
of part one of that project, with the addition in Version 3.0 of improved morphological/part-of-speech
annotation (including full vocalization and case endings). *Data* The project targets
the description of a written Modern Standard Arabic corpus from the Agence France
Presse (AFP) newswire archives for July-November 2000 (files dated 20000/7/15 to 2000/11/15).
This corpus includes 734 stories representing 145,386 words (166,068 tokens after
clitic segmentation in the Treebank; the number of Arabic tokens is 123,796). For
this work, annotators must be native speakers of Arabic, and they must understand
enough linguistics to check morphosyntactic analysis and build syntactic structures.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633267
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 528-410-099-660-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic CTS Levantine Fisher Training Data Set 3 Transcripts provides the transcription
for the speech contained in Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
(LDC2005S07). This training speech release consists of 322 conversations, representing
a total of approximately 50 hours of Levantine Arabic speech. The Fisher telephone
conversation collection protocol was created at LDC to address a critical need of
developers trying to build robust automatic speech recognition (ASR) systems. Previous
collection protocols, such as CALLFRIEND and Switchboard-II and the resulting corpora,
have been adapted for ASR research but were in fact developed for language and speaker
identification respectively. Although the CALLHOME protocol and corpora were developed
to support ASR technology, they feature small numbers of speakers making telephone
calls of relatively long duration with narrow vocabulary across the collection. CALLHOME
conversations are challengingly natural and intimate. Under the Fisher protocol, a
very large number of participants each make a few calls of short duration speaking
to other participants, whom they typically do not know, about assigned topics. This
maximizes inter-speaker variation and vocabulary breadth although it also increases
formality. Previous protocols such as CALLHOME, CALLFRIEND and Switchboard relied
upon participant activity to drive the collection. Fisher is unique in being platform
driven rather than participant driven. Participants who wish to initiate a call may
do so; however the collection platform initiates the majority of calls. Participants
need only answer their phones at the times they specified when registering for the
study. To encourage a broad range of vocabulary, Fisher participants are asked to
speak on an assigned topic which is selected at random from a list, which changes
every 24 hours and which is assigned to all subjects paired on that day. Some topics
are inherited or refined from previous Switchboard studies while others were developed
specifically for the Fisher protocol.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Levantine Arabic and South Levantine Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633283
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 136-463-995-609-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multiple-Translation Arabic (MTA) Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multiple-Translation Arabic (MTA) Part 2, Linguistic Data Consortium (LDC) catalog
number LDC2005T05 and ISBN 1-58563-328-3, was produced by LDC. To support the development
of automatic means for evaluating translation quality, LDC was sponsored to solicit
four sets of human translations for a single set of Arabic source materials. LDC was
also asked to produce translations from various commercial-off-the-shelf-systems (COTS,
including commercial Machine Translation (MT) systems as well as MT systems available
on the Internet). This corpus contains two sets of COTS outputs and one output set
from a TIDES 2003 MT Evaluation participant, which is representative for the state-of-the-art
research systems. To determine if automatic evaluation systems such as BLEU track
human assessment, LDC also performed human assessment on the two COTS outputs and
the TIDES research system. The corpus includes the assessment results for one of the
two COTS systems, the assessment result for the TIDES research system, and the specifications
used for conducting the assessments. *Source Data Selection:* * Xinhua News Service
(Xinhua): 50 news stories * Agence France Presse (AFP): 50 news stories (total: 100
stories) There are 100 source files and 700 translation files. All source data were
drawn from January and February 2003 collection of Xinhua Arabic data and AFP Arabic
data. The story selection from the two newswire collections was controlled by story
length: all selected stories contain between 700 and 1,500 Arabic characters. The
overall count of Arabic words (excluding markup), by source, is shown in the following
table: * AFP 7,528 * Xinhua 7,551 * ------------- * 15,079
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633291
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 008-710-816-829-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese News Translation Text Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese News Translation Text Part 1 was produced by Linguistic Data Consortium (LDC)
catalog number LDC2005T06 and ISBN 1-58563-329-1. To support the development of automatic
machine translation systems, the LDC was sponsored to solicit English translations
for a single set of Chinese source materials. The source Chinese text and its English
translations were selected and translated in different LDC projects during the time
period of February 2003 to January 2005. A total of about 474K Chinese characters
were selected from two sources, namely Xinhua and AFP, and translation services were
provided by seven translation agencies. Each Chinese news story was translated once.
All stories and its translations were created for TIDES Machine Translation as training
data, following roughly the same guidelines and procedures.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
- Geographic subdivision:
China
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633313
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 357-991-519-054-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ACE Time Normalization (TERN) 2004 English Training Data v 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the ACE Time Normalization (TERN) 2004 English
Training Data v 1.0, Linguistic Data Consortium (LDC) catalog number LDC2005T07 and
ISBN 1-58563-331-3. This release contains the English training data prepared for the
2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by
the Automatic Content Extraction (ACE) program. The evaluation was held in August
2004 and a workshop in September 2004. Evaluation participants received this data
for training purposes, and it is now being released for general use. The annotation
specifications for this corpus were developed under DARPA's Translingual Information
Detection Extraction and Summarization (TIDES) program, with continuing support from
ACE. The purpose of this corpus and the TERN evaluation is to advance the state of
the art in the automatic recognition and normalization of natural language temporal
expressions. In most language contexts such expressions are indexical. For example,
with "Monday," "last week," or "three months starting October 1," one must know the
narrative reference time in order to pinpoint the time interval being conveyed by
the expression. In addition, for data exchange purposes, it is essential that the
identified interval be rendered according to an established standard, i.e., normalized.
Accurate identification and normalization of temporal expressions is in turn essential
for the temporal reasoning being demanded by advanced NLP applications such as question
answering, information extraction, and summarization.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ferro, Lisa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gerber, Laurie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hitzeman, Janet
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lima, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sundheim, Beth
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633208
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 983-656-398-539-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Discourse Graphbank
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
As Florian Wolf's Ph.D thesis, the Discourse Treebank aimed to define a descriptively
adequate data structure for representing discourse coherence structures. This project
also investigated the impact of discourse coherence structures on other linguistic
processes and natural language applications (e.g. anaphor resolution,summarization,
information retrieval), and developed and tested discourse parsing algorithms. *Data*
The data consists of 135 texts from AP Newswire and Wall Street Journal, annotated
with coherence relations. The source was UPenn TIPSTER.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Cohesion (Linguistics)
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Discourse analysis
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Information storage and retrieval systems
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Anaphora (Linguistics)
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Parsing (Computer grammar)
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wolf, Florian
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gibson, Edward
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, Amy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Knight, Meredith
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633739
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T29
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 721-717-066-331-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
HARD 2004 Topics and Annotations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T29
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
HARD 2004 Topics and Annotations contains topics and annotations (clarification forms,
responses and relevance assessments) for the 2004 TREC HARD (High Accuracy Retrieval
from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference
(TREC), with the objective of achieving high accuracy retrieval from documents by
leveraging additional information about the searcher and/or the search context, through
techniques like passage retrieval and the use of targeted interaction with the searcher.
The current corpus was previously distributed to HARD Participants as LDC2004E42 and
LDC2005E17. The source data that corresponds to this release is distributed as LDC2005T28,
HARD 2004 Text. This corpus was created with support from the DARPA TIDES Program
and LDC. *Data* Three major annotation tasks are represented in this release: Topic
Creation, Clarification Form Responses, and Relevance Assessment. Topics include a
short title, query plus context, and a number of limiting parameters known as metadata
which include targeted geographical region, target data domain or genre, and level
of searcher expertise. Clarification Forms are brief HTML questionnaires system developers
submitted to LDC searchers to glean additional information about information needs
directly from the topic creators. Relevance assessment consisted of adjudication of
pooled system responses, and included document-level judgments for all topics, and
passage-level relevance judgments for a subset of topics. The release is divided into
training and evaluation resources. The training set comprises twenty-one topics and
100 document-level relevance judgments per topic. The evaluation set contains fifty
topics, clarification forms and responses, document-level relevance assessment for
all topics and passage-level judgments for half of the topics. HARD participants received
the reference data over the course of the evaluation cycle in stages: (0) training
topics, (1) evaluation topic descriptions without metadata, (2) clarification form
responses, (3) topic descriptions with metadata, and (4) relevance assessments.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Information retrieval
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Metadatabases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T29
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636606
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013S09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 030-491-638-667-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSC Deceptive Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013S09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CSC Deceptive Speech was developed by Columbia University, SRI International and University
of Colorado Boulder. It consists of 32 hours of audio interviews from 32 native speakers
of Standard American English (16 male,16 female) recruited from the Columbia University
student population and the community. The purpose of the study was to distinguish
deceptive speech from non-deceptive speech using machine learning techniques on extracted
features from the corpus. The participants were told that they were participating
in a communication experiment which sought to identify people who fit the profile
of the top entrepreneurs in America. To this end, the participants performed tasks
and answered questions in six areas. They were later told that they had received low
scores in some of those areas and did not fit the profile. The subjects then participated
in an interview where they were told to convince the interviewer that they had actually
achieved high scores in all areas and that they did indeed fit the profile. The task
of the interviewer was to determine how he thought the subjects had actually performed,
and he was allowed to ask them any questions other than those that were part of the
performed tasks. For each question from the interviewer, subjects were asked to indicate
whether the reply was true or contained any false information by pressing one of two
pedals hidden from the interviewer under a table. *Data* Interviews were conducted
in a double-walled sound booth and recorded to digital audio tape on two channels
using Crown CM311A Differoid headworn close-talking microphones, then downsampled
to 16kHz before processing. The interviews were orthographically transcribed by hand
using the NIST EARS transcription guidelines. Labels for local lies were obtained
automatically from the pedal-press data and hand-corrected for alignment, and labels
for global lies were annotated during transcription based on the known scores of the
subjects versus their reported scores. The orthographic transcription was force-aligned
using the SRI telephone speech recognizer adapted for full-bandwidth recordings. There
are several segmentations associated with the corpus: the implicit segmentation of
the pedal presses, derived semi-automatically sentence-like units (EARS SLASH-UNITS
or SUs) which were hand labeled, intonational phrase units and the units corresponding
to each topic of the interview. Transcript files are in .trs format and audio files
are .wav presented in flac-compressed form for this release.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Columbia University
ADDED ENTRY--PERSONAL NAME
- Personal name:
International, SRI
ADDED ENTRY--PERSONAL NAME
- Personal name:
University of Colorado Boulder
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013S09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636614
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 860-172-183-494-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Treebank 8.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Treebank 8.0 consists of approximately 1.5 million words of annotated and
parsed text from Chinese newswire, government documents, magazine articles, various
broadcast news and broadcast conversation programs, web newsgroups and weblogs. The
Chinese Treebank project began at the University of Pennsylvania in 1998, continued
at the University of Colorado and then moved to Brandeis University. The project goal
is to provide a large, part-of-speech tagged and fully bracketed Chinese language
corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically
annotated words from Xinhua News Agency newswire. It was later corrected and released
in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000
words. LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing
roughly 400,000 words, in 2004. A year later, LDC published the 500,000 word Chinese
Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted
of 780,000 words. Chinese Treebank 7.0 (LDC2010T08), released in 2010, added new annotated
newswire data, broadcast material and web text to the approximate total of one million
words. Chinese Treebank 8.0 adds new annotated data from newswire, magazine articles
and government documents. *Data* There are 3,007 text files in this release, containing
71,369 sentences, 1,620,561 words, 2,589,848 characters (hanzi or foreign). The data
is provided in UTF-8 encoding, and the annotation has Penn Treebank-style labeled
brackets. Details of the annotation standard can be found in the segmentation, POS-tagging
and bracketing guidelines included in this release. The data is provided in four different
formats: raw text, word segmented, POS-tagged and syntactically bracketed formats.
All files were automatically verified and manually checked.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zhang, Xiuhong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jiang, Zixin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chiou, Fu-Dong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chang, Meiyu
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633348
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 789-870-824-708-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ACE 2004 Multilingual Training Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the ACE 2004 Multilingual Training Corpus, Linguistic
Data Consortium (LDC) catalog number LDC2005T09 and ISBN 1-58563-334-8. This publication
contains the complete set of English, Arabic and Chinese training data for the 2004
Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data
of various types annotated for entities and relations and was created by Linguistic
Data Consortium with support from the ACE Program, with additional assistance from
the DARPA TIDES (Translingual Information Detection, Extraction and Summarization)
Program. This data was previously distributed as an e-corpus (LDC2004E17) to participants
in the 2004 ACE evaluation. The objective of the ACE program is to develop automatic
content extraction technology to support automatic processing of human language in
text form. In September 2004, sites were evaluated on system performance in six areas:
Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference,
Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR
given reference entities. All tasks were evaluated in three languages: English, Chinese
and Arabic. The current publication consists of the official training data for these
evaluation tasks. A seventh evaluation area, Timex Detection and Recognition, is supported
by the ACE Time Normalization (TERN) 2004 English Training Data Corpus (LDC2005T07).
The TERN corpus source data largely overlaps with the English source data contained
in the current release. A complete description of the ACE 2004 Evaluation can be found
on the ACE Program website maintained by the National Institute of Standards and Technology
(NIST): http://www.nist.gov/speech/tests/ace/ For more information about linguistic
resources for the ACE program, including annotation guidelines, task definitions,
free annotation tools and other documentation, please visit LDC's ACE website.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mitchell, Alexis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zakhary, Ramez
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u chi d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 629-451-208-314-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese English News Magazine Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the Chinese English News Magazine Parallel Text,
Linguistic Data Consortium (LDC) catalog number LDC2005T10 and ISBN 1-58563-333-X.
This corpus contains Chinese news stories and their English translations LDC collected
via Sinorama Magazine, Taiwan, from 1976 to 2004. It totals 6,366 story pairs, 365,568
sentence pairs, 20M Chinese characters and 9M English words. The corpus is aligned
at sentence level. *Data* Sinorama Magazine is published monthly in several languages,
including Chinese, English, Japanese. LDC received its 1976 to 2000 publications on
a single CD, and its 2001 to 2004 publications via Sinorama's website. The Sinorama
Chinese text was encoded in Big5. The data came story aligned but were lack of sentence
level alignment. The sentence alignment was done at the LDC using Champollion v 1.1.
The final data is put in the data directory, which contains subdirectories for Chinese
documents, English documents, and the sentence level alignment, identified as "Chinese,"
"English," and "alignment." The English and Chinese files may contain one or more
documents, with each document formatted in SGML as follows: [English or Chinese text]
[English or Chinese text] [English or Chinese text] ... Notes: * the and tags are
always assigned sequential numeric IDs, starting at one. * the tags are always placed
on the same line with their contents, and are always separated from the contents by
a space. * if an English file and a Chinese file share the same file name, they contain
the same documents. * all Chinese text is encoded in Big5. Each alignment file contains
the sentence level alignment of multiple documents, each being formatted in SGML as
follows: ... Notes: * the docid in an English file, its Chinese translation and the
ALIGNMENT are the same. * EnglishSegId and ChineseSegId may have none, one, or more
than one segment IDs.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
- Geographic subdivision:
China
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 274-788-133-216-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English Gigaword Second Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
English Gigaword Second Edition was produced by Linguistic Data Consortium (LDC) catalog
number LDC2005T12 and ISBN 1-58563-350-X. The English Gigaword corpus is a comprehensive
archive of newswire text data in English that has been acquired over several years
by the LDC. This is the second edition of the English Gigaword corpus. This edition
includes all of the contents in the first edition of the English Gigaword corpus (LDC2003T05)
as well as new data from July 2002 through Dec 2004. Also, a new newswire source (the
Central New Agency of Taiwan, English Service) has been added in this edition. The
five distinct international sources of English newswire included in this release are
the following: Agence France-Presse, English Service (afp_eng ) Associated Press Worldstream,
English Service (apw_eng) Central News Agency of Taiwan, English Service (cna_eng)
The New York Times Newswire Service (nyt_eng) The Xinhua News Agency, English Service
(xin_eng) *What's New In The Second Edition** New newswire data contents from July
2002 to December 2004 have been added for all of the four newswire sources that were
represented in the first edition. * A new source, the Central News Agency of Taiwan
English Service (CNA_ENG), has been added. * We have adopted a new naming scheme for
filenames and DOC IDs. The new naming scheme represents the source names in a three-letter
code and the language name in a three-letter code. * Minor formatting improvements
(mostly line-wrapping) have been made to some of the data contents originally published
in the first edition.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636630
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 462-157-606-044-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
The ARRAU Corpus of Anaphoric Information
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The ARRAU (Anaphora Resolution and Underspecification) Corpus of Anaphoric Information
was developed by the University of Essex and the University of Trento. It contains
annotations of multi-genre English texts for anaphoric relations with information
about agreement and explicit representation of multiple antecedents for ambiguous
anaphoric expressions and discourse antecedents for expressions which refer to abstract
entities such as events, actions and plans. The source texts in this release include
task-oriented dialogues from the TRAINS-91 and TRAINS-93 corpora (the latter released
through LDC, TRAINS Spoken Dialog Corpus LDC95S25), narratives from the English Pear
Stories (a collection of narratives by subjects who watched a film and then recounted
its contents), articles from the Wall Street Journal portions of the Penn Treebank
(Treebank-2 LDC95T7) and the RST Discourse Treebank LDC2002T07, and the Vieira/Poesio
Corpus which consists of training and test files from Treebank-2 and RST Discourse
Treebank. *Data* The texts were annotated using the ARRAU guidelines which treat all
noun phrases (NPs) as markables. Different semantic roles are recognized by distinguishing
between referring expressions (that update or refer to a discourse model), and non-referring
ones (including expletives, predicative expressions, quantifiers, and coordination).
A variety of linguistic features were also annotated, including morphosyntactic agreement,
grammatical function, semantic type (person, animate, concrete, action, time, other
abstract) and genericity. The annotation was carried out using the MMAX2 annotation
tool which allows text units to be marked at different levels. The files in MMAX format
have been organized so that they can be visualized using the MMAX2 tool or directly
used as input/output for the BART toolkit which performs automatic coreference resolution
including all necessary preprocessing steps.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Poesio, Massimo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Artstein, Ron
ADDED ENTRY--PERSONAL NAME
- Personal name:
Uryupina, Olga
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rodriguez, Kepa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Delogu, Francesca
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bristot, Antonella
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hitzeman, Janet
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633402
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 181-921-208-336-7
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial
Grammar derivations. It pairs syntactic derivations with sets of word-word dependencies
which approximate the underlying predicate-argument structure. *Data* CCGbank contains
99.44% of the sentences in the Penn Treebank, for which it corrects a number of inconsistencies
and errors in the original annotation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Cross-language information retrieval
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic abstracting
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hockenmaier, Julia
ADDED ENTRY--PERSONAL NAME
- Personal name:
Steedman, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633534
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 292-607-460-859-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Gigaword Second Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Gigaword Release Second Edition was produced by Linguistic Data Consortium
(LDC) catalog number LDC2005T14 and ISBN 1-58563-353-4. This is a comprehensive archive
of newswire text data in Chinese that has been acquired over several years by the
LDC. This edition includes all of the contents in the first release of the Chinese
Gigaword corpus (LDC2003T09) as well as new data collected after the publication of
the first edition. Also, a limited number of articles from a new newspaper source
(Zaobao) have been added in this edition. The three distinct international sources
of Chinese newswire included in this edition are the following: Central News Agency,
Taiwan (cna_cmn) Xinhua News Agency (xin_cmn) Zaobao Newspaper (zbn_cmn) The seven-character
abbreviations shown above represent both the source name and the language ID ("cmn"
for Mandarin Chinese). *New In Second Edition* New documents (Xinhua from October
2002 through December 2004 and CNA from January 2003 through December 2004) have been
added. A new newspaper source (Lianhe Zaobao) has been added.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636649
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 526-115-548-399-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 was developed
by the Linguistic Data Consortium (LDC) and contains 179,842 tokens of word aligned
Chinese and English parallel text enriched with linguistic tags. This material was
used as training data in the DARPA GALE (Global Autonomous Language Exploitation)
program. Some approaches to statistical machine translation include the incorporation
of linguistic knowledge in word aligned text as a means to improve automatic word
alignment and machine translation quality. This is accomplished with two annotation
schemes: alignment and tagging. Alignment identifies minimum translation units and
translation relations by using minimum-match and attachment annotation approaches.
A set of word tags and alignment link tags are designed in the tagging scheme to describe
these translation units and relations. Tagging adds contextual, syntactic and language-specific
features to the alignment annotation. Other releases available in this series are:
* GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and
Web (LDC2012T16) * GALE Chinese-English Word Alignment and Tagging Training Part 2
-- Newswire (LDC2012T20) * GALE Chinese-English Word Alignment and Tagging Training
Part 3 -- Web (LDC2012T24) * GALE Chinese-English Word Alignment and Tagging Training
Part 4 -- Web (LDC2013T05) *Data* This release consists of Chinese source broadcast
conversation (BC) and broadcast news (BN) programming collected by LDC in 2005 - 2007.
The distribution by genre, words, character tokens and segments appears below: Language</TD>
Genre Docs Words CharTokens Segments Chinese BC 12 51,192 76,789 2,943 Chinese BN
16 68,702 103,053 3,539 Note that all token counts are based on the Chinese data only.
One token is equivalent to one character and one word is equivalent to 1.5 characters.
The Chinese word alignment tasks consisted of the following components: * Identifying,
aligning, and tagging 8 different types of links * Identifying, attaching, and tagging
local-level unmatched words * Identifying and tagging sentence/discourse-level unmatched
words * Identifying and tagging all instances of Chinese 的(DE) except when they were
a part of a semantic link.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633399
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 114-628-220-295-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT4 Multilingual Text and Annotations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TDT4 Multilingual Text and Annotations was developed by the Linguistic Data Consortium
(LDC) with support from the DARPA TIDES (Translingual Information Detection, Extraction
and Summarization) Program. This release contains the complete set of English, Arabic
and Chinese news text (broadcast news transcripts and newswire data) used in the 2002
and 2003 Topic Detection and Tracking technology evaluations, along with topic annotations
created for those evaluations. The audio corresponding to the broadcast news transcripts
in this release can be found in TDT4 Multilingual Broadcast News Speech Corpus (LDC2005S11).
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically
related material in streams of data such as newswire and broadcast news. Evaluation
tasks in 2002 and 2003 included the segmentation of a news source into stories, the
tracking of known topics, the detection of unknown topics, the detection of initial
stories on unknown topics, and the detection of pairs of stories on the same topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633364
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 972-386-127-770-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Fisher English Training Part 2, Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Fisher English Training Part 2 Transcripts represents the second half of a collection
of conversational telephone speech (CTS) that was created at the LDC during 2003.
It consists of time-aligned transcripts for the speech contained in Fisher English
Training Part 2, Speech (LDC2005S13). The Fisher telephone conversation collection
protocol was created at the LDC to address a critical need of developers trying to
build robust automatic speech recognition (ASR) systems. Previous collection protocols,
such as CALLFRIEND and Switchboard-II and the resulting corpora, have been adapted
for ASR research but were in fact developed for language and speaker identification
respectively. Although the CALLHOME protocol and corpora were developed to support
ASR technology, they feature small numbers of speakers making telephone calls of relatively
long duration with narrow vocabulary across the collection. CALLHOME conversations
are challengingly natural and intimate. Under the Fisher protocol, a large number
of participants each calls an other participant, whom they typically do not know,
for a short short period of time to discuss the assigned topics. This maximizes inter-speaker
variation and vocabulary breath while also increasing formality. Previous protocols
such as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive
the collection. Fisher is unique in being platform driven rather than participant
driven. Participants who wish to initiate a call may do so however the collection
platform initiates the majority of calls. Participants need only answer their phones
at the times they specified when registering for the study. To encourage a broad range
of vocabulary, Fisher participants are asked to speak on an assigned topic which is
selected at random from a list, which changes every 24 hours and which is assigned
to all subjects paired on that day. Some topics are inherited or refined from previous
Switchboard studies while others were developed specifically for the Fisher protocol.
*Data* The first half of the collection (Fisher English Training Speech,Part 1) was
released by the LDC in 2004 (LDC2004S13 for speech data,LDC2004T19 for transcripts).
Taken as a whole, the two parts comprise11,699 recorded telephone conversations. The
individual audio files are presented in NIST SPHERE format, and contain two-channel
mu-law sample data shorten compression has been applied to all files. Data collection
and transcription were sponsored by DARPA and the U.S. Department of Defense, as part
of the EARS project for research and development in automatic speech recognition.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kimball, Owen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, Dave
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630195
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S1
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 664-033-662-630-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TIMIT Acoustic-Phonetic Continuous Speech Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S1
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic
studies and for the development and evaluation of automatic speech recognition systems.
TIMIT contains broadband recordings of 630 speakers of eight major dialects of American
English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned
orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform
file for each utterance. Corpus design was a joint effort among the Massachusetts
Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc.
(TI). The speech was recorded at TI, transcribed at MIT and verified and prepared
for CD-ROM production by the National Institute of Standards and Technology (NIST).
The TIMIT corpus transcriptions have been hand verified. Test and training subsets,
balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable
information is included as well as written documentation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lamel, Lori F.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dahlgren, Nancy L.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zue, Victor
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S1
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630187
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 177-353-807-744-3
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus contains speech which was originally designed and collected at Texas Instruments,
Inc. (TI) for the purpose of designing and evaluating algorithms for speaker-independent
recognition of connected digit sequences. There are 326 speakers (111 men, 114 women,
50 boys and 51 girls) each pronouncing 77 digit sequences. Each speaker group is partitioned
into test and training subsets. The corpus was collected at TI in 1982 in a quiet
acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone, digitized
at 20kHz. The waveform files are in the NIST SPHERE format. Updates As of April, 2015,
TIDIGITS is also available in flac compressed wav. This package is available to licensees
as an additional download. Not included in this version are the folders relating to
handling the shortened sphere files of the original corpus.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Leonard, R. Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630144
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 520-913-092-152-0
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Road Rally corpus was designed for the development and testing of word-spotting
systems and was collected in a conversational domain using a road rally planning task
as the topic. The corpus actually consists of two sub-corpora: "Stonehenge" and "Waterloo."
The Stonehenge corpus contains road rally planning conversations as well as some read
speech collected using high quality microphones and a telephone-simulating filter.
The Waterloo corpus contains read road rally planning domain speech which was collected
using actual telephone lines.* Stonehenge The Stonehenge corpus was collected from
subjects using telephone handsets which were modified to contain a high quality microphone.
To gather conversational data, two talkers were located in separate rooms, given a
road map and asked to participate in a road rally planning task. Their objective was
to form a path between two locations on the map which would maximize their road rally
point score. They were also given a time limit in which to complete the task to increase
their responsiveness. Their speech was recorded on a stereo tape recorder with each
subject's speech on a separate track. The tracks were digitized and the speech was
edited to remove silences longer than a second or so. This resulted in approximately
three minutes of continuous speech per subject. The speech was filtered using a 300Hz
to 3300Hz PCM FIR bandpass filter to simulate telephone bandwidth quality. The Stonehenge
corpus consists of 80 speakers; 28 females and 52 males.* Waterloo The Waterloo corpus
was collected as an extension to Stonehenge to provide similar domain speech under
different conditions. The corpus was collected from subjects using conventional telephones
and dialed up telephone lines in the Massachussetts area. Unlike the Stonehenge speech,
the Waterloo speech is naturally band-limited by the telephones/lines but for consistency,
the speech was also filtered using the Stonehenge 300Hz to 3300Hz PCM FIR bandpass
filter. The corpus consists of 56 speakers (28 males and 28 females) each reading
aloud a paragraph of road rally domain speech.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630098
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 777-455-577-608-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
HCRC Map Task Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Originally published as set of eight CD-ROMS, the Map Task Corpus is now delivred
as a web download. The contents of each disc reside in seprate directories with the
same structure as the original set. The Map Task Corpus contains a total of about
18 hours of spontaneous speech that was recorded from 128 two-person conversations,
involving 64 different speakers (32 female, 32 male, all adults, each taking part
in four conversations). The 64 speakers were all students at the University of Glasgow,
61 of them being native Scots. The conversations were carried out in an experimental
setting, in which each participant has a schematic map in front of them, not visible
to the other. Each map is comprised of an outline and roughly a dozen labelled features
(e.g. a white cottage, an oak forest, Green Bay, etc). Most features are common to
the two maps, but not all. One map has a route drawn in, the other does not. The task
is for the participant without the route to draw one on the basis of discussion with
the participant with the route. In addition to the conversations, each speaker provides
a wordlist reading, consisting of the major vocabulary items contained in the conversations.
The experimental design allows a number of different phonemic, syntactico-semantic
and pragmatic contrasts to be explored in a controlled way. In particular, maps and
feature names were designed to allow for controlled exploration of phonological reductions
of various kinds in a number of different referential contexts and to provide, via
varying patterns of matches and mis-matches between the two maps, a range of different
stimuli for referent negotiation. Also the conditions of the conversations were carefully
balanced: In half of them the talkers were strangers, in half friends in half of them
the talkers could see each others faces, in half they could not. The waveform data
are provided in raw (headerless) files (16-bit samples, 20 kHz sample rate, two channels
per conversation) and alternative header files are provided for use with software
based on either the NIST SPHERE header structure or the European SAM header structure.
Text transcriptions are provided for each conversation, along with PostScript files
of the map images used in the experiments. Additional materials include full documentation
of the experimental design and data collection protocol, resources for using SGML
tools on the transcriptions and other text materials and an extensive set of source
code for performing basic signal processing functions on the waveform data, such as
down-sampling, de-multiplexing, channel summation and D/A conversion for Sun workstations
(including playback of segments selected via inspection of transcripts in Emacs).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630101
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S2
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 032-224-820-254-0
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S2
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NTIMIT was developed by the NYNEX Science and Technology Speech Communication Group
to provide a telephone bandwidth adjunct to TIMIT (LDC93S1). NTIMIT was collected
by transmitting all 6,300 original TIMIT recordings through a telephone handset and
over various channels in the NYNEX telephone network and redigitizing them. The recordings
were transmitted through ten Local Access and Transport Areas, half of which required
the use of long-distance carriers. In order to calibrate the transmission characteristics
of the various channels, stationary 1 kHz and frequency-sweeping tones were also recorded
for each of the transmission channels. The re-recorded waveforms were time-aligned
with the original TIMIT waveforms so that the TIMIT time-aligned transcriptions can
be used with NTIMIT as well. In addition to the documentation included with this release,
see Jankowski et al., "NTIMIT: A Phonetically Balanced, Continuous Speech, Telephone
Bandwidth Speech Database," Proc. ICASSP-90, April 1990. NYNEX retains full copyright
on the corpus and all associated materials.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Goudie-Marshall, Kathleen M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jankowski, Charles
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kalyanswamy, Ashok
ADDED ENTRY--PERSONAL NAME
- Personal name:
Basson, Sara
ADDED ENTRY--PERSONAL NAME
- Personal name:
Spitz, Judith
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S2
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632201
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S3A
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 257-512-523-174-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Resource Management Complete Set 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S3A
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93S3A - Resource Management Complete Set 2.0 LDC93S3B - Resource Management (RM1)
2.0 LDC93S3C - Resource Management (RM2) 2.0 The DARPA Resource Management corpora
(RM) consist of digitized and transcribed speech for use in designing and evaluating
continuous speech recognition systems. There are two main parts, often referred to
as RM1 and RM2. RM1 contains three sections, Speaker-Dependent (SD) training data,
Speaker-Independent (SI) training data and test and evaluation data. RM2 has an additional
and larger SD data set, including test material. Resource Management Complete Set
2.0 contains RM1 and RM2. All RM material consists of read sentences modeled after
a naval resource management task. The complete corpus contains over 25,000 utterances
from more than 160 speakers representing a variety of American dialects. The material
was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset
microphone. All discs conform to the ISO-9660 data format. Resource Managment SD and
SI Training and Test Data (RM1) The Speaker-Dependent (SD) Training Data contains
12 subjects, each reading a set of 600 "training sentences," two "dialect" sentences
and ten "rapid adaptation" sentences, for a total of 7,344 recorded sentence utterances.
The 600 sentences designated as training cover 97 of the lexical items in the corpus.
The Speaker-Independent (SI) Training Data contains 80 speakers, each reading two
"dialect" sentences plus 40 sentences from the Resource Management text corpus, for
a total of 3,360 recorded sentence utterances. Any given sentence from a set of 1,600
Resource Management sentence texts was recorded by two subjects, while no sentence
was read twice by the same subject. RM1 contains all SD and SI system test material
used in five DARPA benchmark tests conducted in March and October of 1987, June 1988,
and February and October 1989, along with scoring and diagnostic software and documentation
for those tests. Documentation is also provided outlining use of the Resource Management
training and test material at CMU in development of the SPHINX system. Example output
and scored results for state-of-the-art speaker-dependent and speaker-independent
systems (i.e. the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark
tests are included. Extended Resource Management Speaker-Dependent Corpus (RM2) This
set forms a speaker-dependent extension to the Resource Management (RM1) corpus. The
corpus consists of a total of 10,508 sentence utterances (two male and two female
speakers each speaking 2,652 sentence texts). These include the 600 "standard" Resource
Management speaker-dependent training sentences, two dialect calibration sentences,
ten rapid adaptation sentences, 1,800 newly-generated extended training sentences,
120 newly-generated development-test sentences and 120 newly-generated evaluation-test
sentences. The evaluation-test material on this disc was used as the test set for
the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings).
The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring
software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences
and is included in this publication.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, W.M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bernstein, Jared
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, D.S.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S3A
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S3B
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 925-711-891-661-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Resource Management RM1 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S3B
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93S3A - Resource Management Complete Set 2.0 LDC93S3B - Resource Management (RM1)
2.0 LDC93S3C - Resource Management (RM2) 2.0 The DARPA Resource Management corpora
(RM) consist of digitized and transcribed speech for use in designing and evaluating
continuous speech recognition systems. There are two main parts, often referred to
as RM1 and RM2. RM1 contains three sections, Speaker-Dependent (SD) training data,
Speaker-Independent (SI) training data and test and evaluation data. RM2 has an additional
and larger SD data set, including test material. Resource Management Complete Set
2.0 contains RM1 and RM2. All RM material consists of read sentences modeled after
a naval resource management task. The complete corpus contains over 25,000 utterances
from more than 160 speakers representing a variety of American dialects. The material
was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset
microphone. Resource Managment SD and SI Training and Test Data (RM1) The Speaker-Dependent
(SD) Training Data contains 12 subjects, each reading a set of 600 "training sentences,"
two "dialect" sentences and ten "rapid adaptation" sentences, for a total of 7,344
recorded sentence utterances. The 600 sentences designated as training cover 97 of
the lexical items in the corpus. The Speaker-Independent (SI) Training Data contains
80 speakers, each reading two "dialect" sentences plus 40 sentences from the Resource
Management text corpus, for a total of 3,360 recorded sentence utterances. Any given
sentence from a set of 1,600 Resource Management sentence texts was recorded by two
subjects, while no sentence was read twice by the same subject. RM1 contains all SD
and SI system test material used in 5 DARPA benchmark tests conducted in March and
October of 1987, June 1988 and February and October 1989, along with scoring and diagnostic
software and documentation for those tests. Documentation is also provided outlining
use of the Resource Management training and test material at CMU in development of
the SPHINX system. Example output and scored results for state-of-the-art speaker-dependent
and speaker-independent systems (i.e. the BBN BYBLOS and CMU SPHINX systems) for the
October 1989 benchmark tests are included. Extended Resource Management Speaker-Dependent
Corpus (RM2) This set forms a speaker-dependent extension to the Resource Management
(RM1) corpus. The corpus consists of a total of 10,508 sentence utterances (two male
and two female speakers each speaking 2,652 sentence texts). These include the 600
"standard" Resource Management speaker-dependent training sentences, two dialect calibration
sentences, ten rapid adaptation sentences, 1,800 newly-generated extended training
sentences, 120 newly-generated development-test sentences and 120 newly-generated
evaluation-test sentences. The evaluation-test material was used as the test set for
the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings).
The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring
software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences
and is included in this publication.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, W.M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bernstein, Jared
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, D.S.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S3B
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630136
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S3C
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 927-789-877-742-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Resource Management RM2 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S3C
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93S3A - Resource Management Complete Set 2.0 LDC93S3B - Resource Management (RM1)
2.0 LDC93S3C- Resource Management (RM2) 2.0 The DARPA Resource Management corpora
(RM) consist of digitized and transcribed speech for use in designing and evaluating
continuous speech recognition systems. There are two main parts, often referred to
as RM1 and RM2. RM1 contains three sections, Speaker-Dependent (SD) training data,
Speaker-Independent (SI) training data and test and evaluation data. RM2 has an additional
and larger SD data set, including test material. Resource Management Complete Set
2.0 contains RM1 and RM2. All RM material consists of read sentences modeled after
a naval resource management task. The complete corpus contains over 25,000 utterances
from more than 160 speakers representing a variety of American dialects. The material
was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset
microphone. All discs conform to the ISO-9660 data format. Resource Managment SD and
SI Training and Test Data (RM1) The Speaker-Dependent (SD) Training Data contains
12 subjects, each reading a set of 600 "training sentences," two "dialect" sentences
and ten "rapid adaptation" sentences, for a total of 7,344 recorded sentence utterances.
The 600 sentences designated as training cover 97 of the lexical items in the corpus.
The Speaker-Independent (SI) Training Data contains 80 speakers, each reading two
"dialect" sentences plus 40 sentences from the Resource Management text corpus, for
a total of 3,360 recorded sentence utterances. Any given sentence from a set of 1,600
Resource Management sentence texts was recorded by two subjects, while no sentence
was read twice by the same subject. RM1 contains all SD and SI system test material
used in five DARPA benchmark tests conducted in March and October of 1987, June 1988
and February and October 1989, along with scoring and diagnostic software and documentation
for those tests. Documentation is also provided outlining use of the Resource Management
training and test material at CMU in development of the SPHINX system. Example output
and scored results for state-of-the-art speaker-dependent and speaker-independent
systems (i.e. the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark
tests are included. Extended Resource Management Speaker-Dependent Corpus (RM2) This
set forms a speaker-dependent extension to the RM1 corpus. The corpus consists of
a total of 10,508 sentence utterances (two male and two female speakers each speaking
2,652 sentence texts). These include the 600 "standard" Resource Management speaker-dependent
training sentences, two dialect calibration sentences, ten rapid adaptation sentences,
1,800 newly-generated extended training sentences, 120 newly-generated development-test
sentences and 120 newly-generated evaluation-test sentences. The evaluation-test material
on this disc was used as the test set for the June 1990 DARPA SLS Resource Management
Benchmark Tests (see the Proceedings). The RM2 corpus was recorded at Texas Instruments.
The NIST speech recognition scoring software originally distributed on the RM1 "Test"
Disc was adapted for RM2 sentences and is included in this publication.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, W.M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bernstein, Jared
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, D.S.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S3C
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630012
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S4A
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 101-041-175-695-3
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S4A
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The ATIS0 Corpus is comprised of spontaneous data from 36 speakers; read versions
of the data from 20 of those speakers, along with some adaptation material; and extensive
speaker dependent material from the ATIS domain, read by ten of the same speakers.
LDC also released: LDC93S4B - ATIS0 Pilot, LDC93S4B-2 - ATIS0 Read, and LDC93S4B-3
- ATIS0 SD-Read *Data* All ATIS speech data is recorded at 16kHz sample rate, 16-bit
quantization, from two different microphones, a close-talking (Sennheiser HMD414)
and a desk-top (Crown PCC-160) model. ATIS0 Pilot contains spontaneous utterances
elicited in a "Wizard-of-Oz" simulation, along with the relational database containing
the travel information (excluding connecting flights). 36 speakers produced a total
of 912 utterances. ATIS0 Read contains "read" versions of the spontaneous utterances
for 20 of the 36 speakers above, for a total of 478 productions. This is supplemented
by a set of 40 "adaptation" sentences read by each of the 20 speakers. ATIS0 SD-Read
contains "read" speech in the ATIS domain for ten of the speakers on ATIS0 Pilot.
They read a total of 3,171 utterances, or approximately 317 utterances per speaker.
This data was collected for the purpose of training speaker-dependent speech recognition
systems for the ATIS0 domain. This section also contains the close-talking (Sennheiser)
microphone data and corresponding data for the desk-top (Crown PCC-160) microphone.
Thus there are 6,342 waveform files in this section.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hemphill, Charles T.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, John J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S4A
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630020
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S4B
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 477-521-980-972-9
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S4B
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93S4A - Complete ATIS0 corpus LDC93S4B - ATIS0 Pilot LDC93S4B-2 - ATIS0 Read LDC93S4B-3
- ATIS0 SD-Read The ATIS0 Corpus is comprised of six parts: one with spontaneous data
from 36 speakers; one with read versions of the data from 20 of those speakers, along
with some adaptation material; and four with extensive speaker dependent material
from the ATIS domain, read by ten of the same speakers. All ATIS speech data is recorded
at 16kHz sample rate, 16-bit quantization, from two different microphones, a close-talking
(Sennheiser HMD414) and a desk-top (Crown PCC-160) model. The first disc (ATIS0 Pilot)
contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with
the relational database containing the travel information (excluding connecting flights).
36 speakers produced a total of 912 utterances. The second disc (ATIS0 Read) contains
"read" versions of the spontaneous utterances for 20 of the 36 speakers above, for
a total of 478 productions. This is supplemented by a set of 40 "adaptation" sentences
read by each of the 20 speakers. The third through the sixth discs (ATIS0 SD-Read)
contain "read" speech in the ATIS domain for ten of the speakers on the first disc.
They read a total of 3,171 utterances, or approximately 317 utterances per speaker.
This data was collected for the purpose of training speaker-dependent speech recognition
systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser)
microphone data and the other two contain corresponding data for the desk-top (Crown
PCC-160) microphone. Thus there are 6,342 waveform files on the four discs.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hemphill, Charles T.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, John J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dahlgren, Nancy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tjaden, Brett
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S4B
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630039
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S4B-2
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 470-709-845-333-7
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S4B-2
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93S4A - Complete ATIS0 corpus LDC93S4B - ATIS0 Pilot LDC93S4B-2 - ATIS0 Read LDC93S4B-3
- ATIS0 SD-Read The ATIS0 Corpus totals six CD-ROMs: one with spontaneous data from
36 speakers; one with read versions of the data from 20 of those speakers, along with
some adaptation material; and four with extensive speaker dependent material from
the ATIS domain, read by ten of the same speakers. All ATIS speech data is recorded
at 16kHz sample rate, 16-bit quantization, from two different microphones, a close-talking
(Sennheiser HMD414) and a desk-top (Crown PCC-160) model. The first disc (ATIS0 Pilot)
contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with
the relational database containing the travel information (excluding connecting flights).
36 speakers produced a total of 912 utterances. The second disc (ATIS0 Read) contains
"read" versions of the spontaneous utterances for 20 of the 36 speakers above, for
a total of 478 productions. This is supplemented by a set of 40 "adaptation" sentences
read by each of the 20 speakers. The third through the sixth discs (ATIS0 SD-Read)
contain "read" speech in the ATIS domain for ten of the speakers on the first disc.
They read a total of 3,171 utterances, or approximately 317 utterances per speaker.
This data was collected for the purpose of training speaker-dependent speech recognition
systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser)
microphone data and the other two contain corresponding data for the desk-top (Crown
PCC-160) microphone. Thus there are 6,342 waveform files on the four discs.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hemphill, Charles T.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, John J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dahlgren, Nancy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tjaden, Brett
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S4B-2
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630047
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S4B-3
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 772-073-881-651-5
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S4B-3
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93S4A - Complete ATIS0 corpus LDC93S4B - ATIS0 Pilot LDC93S4B-2 - ATIS0 Read LDC93S4B-3
- ATIS0 SD-Read The ATIS0 Corpus totals six CD-ROMs: one with spontaneous data from
36 speakers; one with read versions of the data from 20 of those speakers, along with
some adaptation material; and four with extensive speaker dependent material from
the ATIS domain, read by ten of the same speakers. All ATIS speech data is recorded
at 16kHz sample rate, 16-bit quantization, from two different microphones, a close-talking
(Sennheiser HMD414) and a desk-top (Crown PCC-160) model. The first disc (ATIS0 Pilot)
contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with
the relational database containing the travel information (excluding connecting flights).
Thirty-six speakers produced a total of 912 utterances. The second disc (ATIS0 Read)
contains "read" versions of the spontaneous utterances for 20 of the 36 speakers above,
for a total of 478 productions. This is supplemented by a set of 40 "adaptation" sentences
read by each of the 20 speakers. The third through the sixth discs (ATIS0 SD-Read)
contain "read" speech in the ATIS domain for ten of the speakers on the first disc.
They read a total of 3,171 utterances, or approximately 317 utterances per speaker.
This data was collected for the purpose of training speaker-dependent speech recognition
systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser)
microphone data and the other two contain corresponding data for the desk-top (Crown
PCC-160) microphone. Thus there are 6,342 waveform files on the four discs. *Update*
This publication has been condensed from 4 CDROM discs to a single DVDROM. The contents
of each CD reside in separate directories that are organized identically to the original
version.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hemphill, Charles T.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, John J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dahlgren, Nancy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tjaden, Brett
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S4B-3
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630055
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S5
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 387-394-427-128-0
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S5
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The ATIS2 corpus contains approximately 15,000 utterances recorded from approximately
450 subjects at five sites: ATT, BBN, CMU, MIT's Laboratory for Computer Science and
SRI. All utterances have been transcribed and almost 10,000 of them annotated with
categorizations and canonical reference answers. Unlike the ATIS0 corpus, much of
the data in ATIS2 was collected using partially or fully-automated data collection
systems. The fully-automated data collection systems were, in fact, working ATIS prototypes.
For ATIS2, the ten-city relational database of ATIS0 was revised to accommodate connecting
flights and fares and some table headings were renamed. In addition to training data,
the February and November '92 ATIS Benchmark Tests are included as well. Each contains
approximately 1,000 utterances from the pool of data collected by the five sites.
Audio Sample *Update* This publication has been condensed from four CDROM discs to
a single web download.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hunicke-Smith, Kate
ADDED ENTRY--PERSONAL NAME
- Personal name:
Danielson, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shriberg, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bocchieri, Enrico
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buntschuh, Bruce
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schwartz, Beverly
ADDED ENTRY--PERSONAL NAME
- Personal name:
Peters, Sandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ingria, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Weide, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chang, Yuzong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Thayer, Eric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hirschman, Lynette
ADDED ENTRY--PERSONAL NAME
- Personal name:
Polifroni, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lund, Bruce
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kawai, Goh
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Norton, Lew
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dahl, Deborah
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bates, Madeleine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brown, Michael
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rudnicky, Alexander
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S5
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630063
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S6A
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 296-840-353-630-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSR-I (WSJ0) Complete
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S6A
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93S6A - Complete CSR-I corpus LDC93S6B - CSR-I Sennheiser speech LDC93S6C - CSR-I
other speech During 1991, the DARPA Spoken Language Program initiated efforts to build
a new corpus to support research on large-vocabulary Continuous Speech Recognition
(CSR) systems. The first two CSR Corpora consist primarily of read speech with texts
drawn from a machine-readable corpus of Wall Street Journal news text and are thus
often known as WSJ0 and WSJ1. (Later sections of the CSR set of corpora, however,
will consist of read texts from other sources of North American business news and
eventually from other news domains). The texts to be read were selected to fall within
either a 5,000-word or a 20,000-word subset of the WSJ text corpus. (See the documentation
for details). Some spontaneous dictation is included in addition to the read speech.
The dictation portion was collected using journalists who dictated hypothetical news
articles. Two microphones are used throughout: a close-talking Sennheiser HMD414 and
a secondary microphone, which may vary. The corpora are thus offered in three configurations:
the speech from the Sennheiser, the speech from the other microphone and the speech
from both; all three sets include all transcriptions, tests, documentation, etc. In
general, transcriptions of the speech, test data from ARPA evaluations, scores achieved
by various speech recognition systems and software used in scoring are included on
separate discs from the waveform data.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Paul, Doug
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S6A
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630071
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S6B
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 393-204-041-392-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSR-I (WSJ0) Sennheiser
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S6B
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93S6A - Complete CSR-I corpus LDC93S6B - CSR-I Sennheiser speech LDC93S6C - CSR-I
other speech During 1991, the DARPA Spoken Language Program initiated efforts to build
a new corpus to support research on large-vocabulary Continuous Speech Recognition
(CSR) systems. The first two CSR Corpora consist primarily of read speech with texts
drawn from a machine-readable corpus of Wall Street Journal news text and are thus
often known as WSJ0 and WSJ1. (Later sections of the CSR set of corpora, however,
will consist of read texts from other sources of North American business news and
eventually from other news domains). The texts to be read were selected to fall within
either a 5,000-word or a 20,000-word subset of the WSJ text corpus. (See the documentation
for details). Some spontaneous dictation is included in addition to the read speech.
The dictation portion was collected using journalists who dictated hypothetical news
articles. This amounts to approximately 70 hours of speech. Two microphones are used
throughout: a close-talking Sennheiser HMD414 and a secondary microphone, which may
vary. The corpora are thus offered in three configurations: the speech from the Sennheiser,
the speech from the other microphone and the speech from both; all three sets include
all transcriptions, tests, documentation, etc. In general, transcriptions of the speech,
test data from ARPA evaluations, scores achieved by various speech recognition systems
and software used in scoring are included on separate discs from the waveform data.
Please note this corpus has been updated from its original disc release to a web download,
some of the documentation may still reflect its original disc state. However all data
is still present.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Paul, Doug
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S6B
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S6C
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 828-375-010-195-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSR-I (WSJ0) Other
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S6C
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93S6A - Complete CSR-I corpus LDC93S6B - CSR-I Sennheiser speech LDC93S6C - CSR-I
other speech During 1991, the DARPA Spoken Language Program initiated efforts to build
a new corpus to support research on large-vocabulary Continuous Speech Recognition
(CSR) systems. The first two CSR Corpora consist primarily of read speech with texts
drawn from a machine-readable corpus of Wall Street Journal news text and are thus
often known as WSJ0 and WSJ1. (Later sections of the CSR set of corpora, however,
will consist of read texts from other sources of North American business news and
eventually from other news domains). The texts to be read were selected to fall within
either a 5,000-word or a 20,000-word subset of the WSJ text corpus. (See the documentation
for details). Some spontaneous dictation is included in addition to the read speech.
The dictation portion was collected using journalists who dictated hypothetical news
articles. Two microphones are used throughout: a close-talking Sennheiser HMD414 and
a secondary microphone, which may vary. The corpora are thus offered in three configurations:
the speech from the Sennheiser, the speech from the other microphone and the speech
from both; all three sets include all transcriptions, tests, documentation, etc. In
general, transcriptions of the speech, test data from ARPA evaluations, scores achieved
by various speech recognition systems and software used in scoring are included on
separate discs from the waveform data.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Paul, Doug
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S6C
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630160
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S8
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 427-743-343-017-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Switchboard Credit Card
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S8
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release contains 35 conversations on the topic of "Credit Card Use." The conversations
can be used in training and testing wordspotting systems. In addition to two-channel
mu-law encoded audio waveform files, the disc contains transcriptions, time-alignments
and wordspotting targets.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, John J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Holliman, Ed
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S8
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630179
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S9
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 476-195-137-873-5
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S9
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release contains a corpus of speech which was originally designed and collected
at Texas Instruments, Inc. (TI) in 1980 and used initially in performance assessment
tests of isolated-word speaker-dependent technology. (See "Speech Recognition: Turning
Theory to Practice" by G. R. Doddington and T. B. Schalk, in IEEE Spectrum, Vol. 18,
No. 9, September 1981.) The 46-word vocabulary consists of two sub-vocabularies: (1)
the TI 20-word vocabulary (consisting of the digits zero through nine plus the words
"enter," "erase," "go," "help," "no," "rubout," "repeat," "stop," "start," and "yes"
as well as (2) the TI 26-word "alphabet set" (consisting of the letters "a" through
"z"). *Data* The corpus contains read utterances from 16 speakers (eight males and
eight females) each speaking 26 utterances of the 46-word vocabulary: 16 tokens designated
as training and ten as test. Note these numbers reflect the aim of the collection
and for various reasons, the full number of utterances was not reached for some speakers.
See the included documentation for more information. The corpus was collected at Texas
Instruments in a quiet acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod
microphone at 12.5kHz sample rate with 12-bit quantization. The files are in NIST
SPHERE format and have a ".wav" filename extension.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liberman, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Amsler, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Church, Ken
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hafner, Carole
ADDED ENTRY--PERSONAL NAME
- Personal name:
Klavans, Judy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitch
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mercer, Bob
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pedersen, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Roossin, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Don
ADDED ENTRY--PERSONAL NAME
- Personal name:
Warwick, Susan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zampolli, Antonio
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S9
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630004
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93T1
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 663-248-563-590-7
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93T1
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ACL Data Collection Initiative contains text from the Wall Street Journal, the Collins
English Dictionary, scientific abstracts provided by the U.S. Department of Energy
and a variety of grammatically tagged and parsed materials from the Treebank project
at the University of Pennsylvania. The total amount of uncompressed text is 620 Mbytes.
The many formats of the original texts have been mapped into a markup language consistent
with the SGML standard (ISO 8879). The format of the material from the Wall Street
Journal uses a labelled bracketing, expressed in the style of SGML, although no formal
SGML DTD is provided. The tag set has been modified by turning the Dow Jones header
categories into tags and by creating ad hoc tags such as "". The original datelines
are presented as separate text units; the text is divided and tagged into paragraphs
and sentences with each sentence presented on a single line. Nothing has been done
to modify the typographical methods used to subdivide headlines and stories into sections,
nor are any of the text features within sentences (quotes, ellipsis, etc.) normalized.
The Collins English Dictionary is present in two forms. One form was approximately
parsed into fielded records as an exercise in learning a language called "FIT", by
a student working under the direction of Lloyd Nakatani at ATT Bell Laboratories during
the summer of 1990. The original digital image of the typographer's tape that the
database version was prepared from had serious flaws that were not detected and corrected
until later; the corrected version, a clean typographer's tape, is presented in a
separate directory. A properly-analyzed database version will be provided in the future.
The documentation includes notes developed during the new attempt to analyze the tape
from scratch. The Department of Energy abstracts reside in files that are approximately
one megabyte each. The original 950 separators have been replaced with newlines and
space padding between articles was removed. An acronym dictionary that was extracted
from the database as an indication of the material's topic areas has been included
in a separate directory. Provisional material from the Penn Treebank project is divided
into two subdirectories on this disk. The subdirectory "postext" contains text with
part-of-speech annotations; "parstext" contains text with syntactic bracketing.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93T1
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630209
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93T3A
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 741-001-210-040-2
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93T3A
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93T3A - Complete TIPSTER corpus LDC93T3B - Volume 1 of the TIPSTER corpus LDC93T3C
- Volume 2 of the TIPSTER corpus LDC93T3D - Volume 3 of the TIPSTER corpus TIPSTER
is sometimes also called the Text Research Collection Volume or TREC. The TIPSTER
project was sponsored by the Software and Intelligent Systems Technology Office of
the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance
the state of the art in effective document detection (information retrieval) and data
extraction from large, real-world data collections. The detection data is comprised
of a test collection built at NIST for the TIPSTER project and the related TREC project.
The TREC project has many other participating information retrieval research groups,
working on the same task as the TIPSTER groups, but meeting once a year in a workshop
to compare results (similar to MUC). The test collection consists of three CD-ROMs
of SGML encoded documents distributed by LDC plus queries and answers (relevant documents)
distributed by NIST. Source (vol) Year Approx. # Words (Millions) Associated Press
(1) 1989 40 Associated Press (2) 1988 37 Associated Press (3) 1990 37 Wall Street
Journal (1) 1987 20 Wall Street Journal (1) 1988 17 Wall Street Journal (1) 1989 6
Wall Street Journal (2) 1990 11 Wall Street Journal (2) 1991 22 Wall Street Journal
(2) 1992 5 Dept. of Energy (1) 28 Federal Register (1) 1989 38 Federal Register (2)
1988 30 Ziff/Davis (1) 36 Ziff/Davis (2) 1989-90 26 Ziff/Davis (3) 1991-92 50 San
Jose Mercury News (3) 1991 45 The documents in the test collection are varied in style,
size and subject domain. The first disk contains material from the Wall Street Journal,
(1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information
from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department
of Energy. The second disk contains information from the same sources, but from different
years. The third disk contains more information from the Computer Select disks, plus
material from the San Jose Mercury News (1991), more AP newswire (1990) and about
250 megabytes of formatted U.S. Patents. The format of all the documents is relatively
clean and easy to use, with SGML-like tags separating documents and document fields.
There is no part-of-speech tagging or breakdown into individual sentences or paragraphs
as the purpose of this collection is to test retrieval against real-world data. The
three Tipster discs released have been re-issued with updates and corrections and
all recipients of the earlier versions should have received these replacements free
of charge. If you think you have the unrevised original, contact LDC for confirmation.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harman, Donna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liberman, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93T3A
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630217
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93T3B
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 002-828-735-548-8
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93T3B
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93T3A - Complete TIPSTER corpus LDC93T3B - Volume 1 of the TIPSTER corpus LDC93T3C
- Volume 2 of the TIPSTER corpus LDC93T3D - Volume 3 of the TIPSTER corpus TIPSTER
1 is sometimes also called the Text Research Collection Volume 1 or TREC-1. The TIPSTER
project was sponsored by the Software and Intelligent Systems Technology Office of
the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance
the state of the art in effective document detection (information retrieval) and data
extraction from large, real-world data collections. The detection data is comprised
of a test collection built at NIST for the TIPSTER project and the related TREC project.
The TREC project has many other participating information retrieval research groups,
working on the same task as the TIPSTER groups, but meeting once a year in a workshop
to compare results (similar to MUC). The test collection consists of three CD-ROMs
of SGML encoded documents distributed by LDC plus queries and answers (relevant documents)
distributed by NIST. Source Year Approx. # Words (Millions) Associated Press 1989
40 Wall Street Journal 1987 20 Wall Street Journal 1988 17 Wall Street Journal 1989
6 Dept. of Energy 28 Federal Register 1989 38 Ziff/Davis 36 The documents in the test
collection are varied in style, size and subject domain. The first disk contains material
from the Wall Street Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the
Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing)
and short abstracts from the Department of Energy. The second disk contains information
from the same sources, but from different years. The third disk contains more information
from the Computer Select disks, plus material from the San Jose Mercury News (1991),
more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format
of all the documents is relatively clean and easy to use, with SGML-like tags separating
documents and document fields. There is no part-of-speech tagging or breakdown into
individual sentences or paragraphs as the purpose of this collection is to test retrieval
against real-world data. The three Tipster discs so far released have been re-issued
with updates and corrections and all recipients of the earlier versions should have
received these replacements free of charge. If you think you have the unrevised original,
contact LDC for confirmation.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harman, Donna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liberman, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93T3B
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630225
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93T3C
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 532-662-320-210-9
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93T3C
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93T3A - Complete TIPSTER corpus LDC93T3B - Volume 1 of the TIPSTER corpus LDC93T3C
- Volume 2 of the TIPSTER corpus LDC93T3D - Volume 3 of the TIPSTER corpus TIPSTER
2 is sometimes also called the Text Research Collection Volume 2 or TREC-2. The TIPSTER
project was sponsored by the Software and Intelligent Systems Technology Office of
the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance
the state of the art in effective document detection (information retrieval) and data
extraction from large, real-world data collections. The detection data is comprised
of a test collection built at NIST for the TIPSTER project and the related TREC project.
The TREC project has many other participating information retrieval research groups,
working on the same task as the TIPSTER groups, but meeting once a year in a workshop
to compare results (similar to MUC). The test collection consists of three CD-ROMs
of SGML encoded documents distributed by LDC plus queries and answers (relevant documents)
distributed by NIST. Source Year Approx. # Words (Millions) Associated Press 1988
37 Wall Street Journal 1990 11 Wall Street Journal 1991 22 Wall Street Journal 1992
5 Federal Register 1998 30 Ziff/Davis 1989-09 26 The documents in the test collection
are varied in style, size and subject domain. The first disk contains material from
the Wall Street Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal
Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and
short abstracts from the Department of Energy. The second disk contains information
from the same sources, but from different years. The third disk contains more information
from the Computer Select disks, plus material from the San Jose Mercury News (1991),
more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format
of all the documents is relatively clean and easy to use, with SGML-like tags separating
documents and document fields. There is no part-of-speech tagging or breakdown into
individual sentences or paragraphs as the purpose of this collection is to test retrieval
against real-world data. The three Tipster discs so far released have been re-issued
with updates and corrections and all recipients of the earlier versions should have
received these replacements free of charge. If you think you have the unrevised original,
contact LDC for confirmation.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harman, Donna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liberman, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93T3C
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630233
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93T3D
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 890-582-278-450-2
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93T3D
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC93T3A - Complete TIPSTER corpus LDC93T3B - Volume 1 of the TIPSTER corpus LDC93T3C
- Volume 2 of the TIPSTER corpus LDC93T3D - Volume 3 of the TIPSTER corpus TIPSTER
3 is sometimes also called the Text Research Collection Volume 3 or TREC-3. The TIPSTER
project was sponsored by the Software and Intelligent Systems Technology Office of
the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance
the state of the art in effective document detection (information retrieval) and data
extraction from large, real-world data collections. The detection data is comprised
of a test collection built at NIST for the TIPSTER project and the related TREC project.
The TREC project has many other participating information retrieval research groups,
working on the same task as the TIPSTER groups, but meeting once a year in a workshop
to compare results (similar to MUC). The test collection consists of three CD-ROMs
of SGML encoded documents distributed by LDC plus queries and answers (relevant documents)
distributed by NIST. Source Year Approx. # Words (Millions) Associated Press 1990
37 San Jose Mercury 1991 45 Ziff/Davis 1991-92 250 The documents in the test collection
are varied in style, size and subject domain. The first disk contains material from
the Wall Street Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal
Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and
short abstracts from the Department of Energy. The second disk contains information
from the same sources, but from different years. The third disk contains more information
from the Computer Select disks, plus material from the San Jose Mercury News (1991),
more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format
of all the documents is relatively clean and easy to use, with SGML-like tags separating
documents and document fields. There is no part-of-speech tagging or breakdown into
individual sentences or paragraphs as the purpose of this collection is to test retrieval
against real-world data. The three Tipster discs so far released have been re-issued
with updates and corrections and all recipients of the earlier versions should have
received these replacements free of charge. If you think you have the unrevised original,
contact LDC for confirmation.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harman, Donna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liberman, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93T3D
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630306
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S13A
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 819-269-127-206-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSR-II (WSJ1) Complete
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S13A
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC94S13A - Complete CSR-II corpus LDC94S13B - CSR-II Sennheiser speech LDC94S13C
- CSR-II Other speech *Data* The complete WSJ1 corpus contains approximately 78,000
training utterances (73 hours of speech), 4,000 of which are the result of spontaneous
dictation by journalists with varying degrees of experience in dictation. The corpus
contains approximately 8,200 "conventional" development test utterances (eight hours
of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus,
the entire corpus was collected using two microphones, so the amount of speech in
the entire corpus is about 162 hours. In early 1993, a "Hub and Spoke" test paradigm
was designed, calling for eleven test sets, each a specific variation of the basic
or "hub" condition. The eleven Hub and Spoke Development and Evaluation Test sets
each contain approximately 7,500 waveforms (eleven hours of speech). WSJ1 waveforms
have been compressed by about 2:1 using the SPHERE-embedded "Shorten" compression
algorithm developed at Cambridge University.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S13A
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630314
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S13B
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 418-053-774-232-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSR-II (WSJ1) Sennheiser
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S13B
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC94S13A - Complete CSR-II corpus LDC94S13B - CSR-II Sennheiser speech LDC94S13C
- CSR-II Other speech *Data* The complete WSJ1 corpus contains approximately 78,000
training utterances (73 hours of speech), 4,000 of which are the result of spontaneous
dictation by journalists with varying degrees of experience in dictation. The corpus
contains approximately 8,200 conventional development test utterances (eight hours
of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus,
the entire corpus was collected using two microphones, so the amount of speech in
the entire corpus is about 162 hours. In early 1993, a Hub and Spoke test paradigm
was designed, calling for eleven test sets, each a specific variation of the basic
or hub condition. The eleven Hub and Spoke Development and Evaluation Test sets each
contain approximately 7,500 waveforms (eleven hours of speech). WSJ1 waveforms have
been compressed by about 2:1 using the SPHERE-embedded Shorten compression algorithm
developed at Cambridge University.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S13B
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630322
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S13C
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 595-241-014-505-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSR-II (WSJ1) Other
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S13C
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC94S13A - Complete CSR-II corpus LDC94S13B - CSR-II Sennheiser speech LDC94S13C
- CSR-II Other speech *Data* The complete WSJ1 corpus contains approximately 78,000
training utterances (73 hours of speech), 4,000 of which are the result of spontaneous
dictation by journalists with varying degrees of experience in dictation. The corpus
contains approximately 8,200 "conventional" development test utterances (eight hours
of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus,
the entire corpus was collected using two microphones, so the amount of speech in
the entire corpus is about 162 hours. In early 1993, a "Hub and Spoke" test paradigm
was designed, calling for eleven test sets, each a specific variation of the basic
or "hub" condition. The eleven Hub and Spoke Development and Evaluation Test sets
each contain approximately 7,500 waveforms (eleven hours of speech). WSJ1 waveforms
have been compressed by about 2:1 using the SPHERE-embedded "Shorten" compression
algorithm developed at Cambridge University.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S13C
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630241
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S14A
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 367-677-522-995-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Air Traffic Control Complete
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S14A
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Air Traffic Control Corpus (ATC0) is comprised of recorded speech for use in supporting
research and development activities in the area of robust speech recognition in domains
similar to air traffic control (several speakers, noisy channels, relatively small
vocabulary, constrained languaged, etc.) The audio data is composed of voice communication
traffic between various controllers and pilots. *Data* The audio files are 8 KHz,
16-bit linear sampled data, representing continuous monitoring, without squelch or
silence elimination, of a single FAA frequency for one to two hours. There are also
files which indicate the amplitude of the received AM carrier signal at 10 msec. intervals.
Full transcripts, including the start and end times of each transmission, are provided
for each audio file. Each flight is identified by its flight number. ATC0 consists
of three subcorpora, one for each airport in which the transmissions were collected
-- Dallas Fort Worth (DFW), Logan International (BOS) and Washington National (DCA).
The complete set contains approximately 70 hours of controller and pilot transmissions
collected via antennas and radio receivers which were located in the vicinity of the
respective airports. Detailed information regarding the collection process and the
equipment used can be found on in the files, "atc.doc" in the "doc" directories. The
ATC0 Corpus was collected by Texas Instruments under contract to DARPA. It was produced
by the National Institute of Standards and Technology for distribution by the Linguistic
Data Consortium.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, John J.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S14A
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S14B
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 303-675-958-561-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Air Traffic Control BOS
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S14B
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded speech for
use in supporting research and development activities in the area of robust speech
recognition in domains similar to air traffic control (several speakers, noisy channels,
relatively small vocabulary, constrained languaged, etc.) The audio data on these
discs is composed of voice communication traffic between various controllers and pilots.
*Data* The audio files are 8 KHz, 16-bit linear sampled data, representing continuous
monitoring, without squelch or silence elimination, of a single FAA frequency for
one to two hours. There are also files which indicate the amplitude of the received
AM carrier signal at 10 msec. intervals. Full transcripts, including the start and
end times of each transmission, are provided for each audio file. Each flight is identified
by its flight number. ATC0 consists of three subcorpora, one for each airport in which
the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS)
and Washington National (DCA). The complete set contains approximately 70 hours of
controller and pilot transmissions collected via antennas and radio receivers which
were located in the vicinity of the respective airports. Detailed information regarding
the collection process and the equipment used can be found on each disc in the file,
"atc.doc" in the "doc" directory. The ATC0 Corpus was collected by Texas Instruments
under contract to DARPA. It was produced on CD-ROM by the National Institute of Standards
and Technology for distribution by the Linguistic Data Consortium.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, John J.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S14B
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630268
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S14C
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 847-484-488-292-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Air Traffic Control DCA
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S14C
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded speech for
use in supporting research and development activities in the area of robust speech
recognition in domains similar to air traffic control (several speakers, noisy channels,
relatively small vocabulary, constrained languaged, etc.) The audio data on these
discs is composed of voice communication traffic between various controllers and pilots.
*Data* The audio files are 8 KHz, 16-bit linear sampled data, representing continuous
monitoring, without squelch or silence elimination, of a single FAA frequency for
one to two hours. There are also files which indicate the amplitude of the received
AM carrier signal at 10 msec. intervals. Full transcripts, including the start and
end times of each transmission, are provided for each audio file. Each flight is identified
by its flight number. ATC0 consists of three subcorpora, one for each airport in which
the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS)
and Washington National (DCA). The complete set contains approximately 70 hours of
controller and pilot transmissions collected via antennas and radio receivers which
were located in the vicinity of the respective airports. Detailed information regarding
the collection process and the equipment used can be found on each disc in the file,
"atc.doc" in the "doc" directory. The ATC0 Corpus was collected by Texas Instruments
under contract to DARPA. It was produced on CD-ROM by the National Institute of Standards
and Technology for distribution by the Linguistic Data Consortium.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, John J.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S14C
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630373
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 111-294-703-376-9
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This is two-CD subset of the SWITCHBOARD collection (see above), selected for speaker
ID research and with special attention to telephone instrument variation. It contains
training and testing data for experiments in closed or open set recognition or verification.
Combining the two sides of the conversations also permits speaker change detection,
or speaker monitoring, experiments. There are 45 "target" speakers; four conversations
from each target are included, of which two are from the same handset. There are also
100 calls in which no target appears. Since all conversations are two-sided, this
results in 180 target sides and 180 + 200 = 380 nontarget sides. Except for truncations
of a few longer calls at five minutes, the call themselves are as described under
SWITCHBOARD.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, Jack
ADDED ENTRY--PERSONAL NAME
- Personal name:
Holliman, Ed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 125-762-148-524-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
YOHO Speaker Verification
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The YOHO database contains a large scale, high-quality speech corpus to support text-dependent
speaker authentication research, such as is used in secure access technology. The
data was collected in 1989 by ITT under a US Government contract, but has not been
available for public use before. Note that certain changes have been made to the corpus,
mainly to insure the privacy of the speakers and some data has been withheld by the
government for future use in testing. YOHO contains: * Combination lock phrases (e.g.
36-24-36) * Collected over three-month period in a real-world office environment *
Four enrollment sessions per subject with 24 phrases per session * Ten test sessions
per subject with four phrases per session * 8kHz sampling with 3.8 kHz analog bandwidth
* 1.5 gigabytes of data The number of trials is thus sufficient to permit evaluation
testing at high confidence levels. In each session, a speaker was prompted with a
series of phrases to be read aloud each phrase was a sequence of three two-digit numbers
(e.g. 35 - 72 - 41, pronounced thirty-five seventy-two forty-one). The first four
sessions for a given speaker were enrollment sessions of 24 phrases and all additional
sessions were verification trials of four phrases each. In all there are 552 enrollment
sessions and 1,380 trial sessions, with a nominal time interval of three days between
sessions.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Campbell, Joseph
ADDED ENTRY--PERSONAL NAME
- Personal name:
Higgins, Alan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u vie d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630357
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 650-021-622-719-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ger
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
deu
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
pes
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
OGI Multilanguage Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The corpus consists of responses to prompts spoken over commercial telephone lines
by speakers of English, Farsi (Persian), French, German, Hindi, Japanese, Korean,
Mandarin Chinese, Spanish, Tamil and Vietnamese. It contains a total of 1,927 calls,
an average of 175 calls per language. Speech was collected using an automated system
that answered the telephone, played digitized prompts in the appropriate language
to request the speech samples and digitized the callers' responses for a designated
period of time. Log files are included that provide a set of automatic measurements
made on each utterance. In addition, some utterances were automatically segmented
into broad phonetic catagories. The speech data are compressed, with NIST SPHERE headers.
LANGUAGE NOTE
- Language note:
Content in Vietnamese, Tamil, Korean, Japanese, Hindi, French, English, German, Spanish,
Mandarin Chinese, Persian, Dari, and Iranian Persian. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muthusamy, Yeshwant
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630365
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 718-988-956-252-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
OGI Spelled and Spoken Word
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The OGI Spelled and Spoken Telephone Corpus consists of speech recordings from over
3,650 telephone calls, each made by a different speaker to an automated prompting/recording
system installed at the Oregon Graduate Institute. Speakers were asked to say their
name, where they were calling from and where they grew up; they were asked to answer
a couple of yes/no questions and to spell their first and last names; many were also
asked to repeat a few specific words and to recite the letters of the alphabet. Each
response to a prompt is stored as a separate waveform file and the files are organized
according to prompt (response type); all responses from a given call have a unique
caller-index number as part of the file named, so that responses can easily be sorted
by speaker. Waveform data are stored in compressed form, using the NIST SPHERE 2.0
software package, which is available separately at no charge to users. SPHERE 2.0
provides the decompression software needed to extract the waveform data, as well as
tools for accessing and modifying file headers. Time-aligned phonetic transcriptions
are provided for a subset of responses and a complete log of each (giving speaker
sex, quality judgments and orthographic transcriptions of all responses) is included
in a form suitable for use as a relational data base.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muthusamy, Yeshwant
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630284
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 396-239-314-326-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ATIS3 Training Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The ATIS3 corpus, on three CD-ROMs, includes over 774 scenarios completed by 137 subjects,
yielding a total of over 7,300 utterances. All utterances are transcribed and 2,900
of them have been categorized and annotated with canonical reference answers. The
relational database for this dataset included flight information for 46 cities and
52 airports. Data was collected at BBN, CMU, MIT and SRI, using their own ATIS systems
and at NIST using systems provided by BBN and SRI. Two 1,000-utterance test sets were
set aside from the data pooled by the collection sites. The first set was used in
a December 1993 ARPA test and is included in ATIS3. The second has been reserved for
future testing.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dahl, Deborah A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bates, Madeleine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brown, Michael
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hunicke-Smith, Kate
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pao, Christine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rudnicky, Alexander
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shriberg, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Danielson, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bocchieri, Enrico
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buntschuh, Bruce
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schwartz, Beverly
ADDED ENTRY--PERSONAL NAME
- Personal name:
Peters, Sandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ingria, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Weide, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chang, Yuzong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Thayer, Eric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hirschman, Lynette
ADDED ENTRY--PERSONAL NAME
- Personal name:
Polifroni, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lund, Bruce
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kawai, Goh
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Norton, Lew
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630292
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 361-674-631-516-1
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The recordings in this release were originally made in 1978-79 as part of a British
Home Office study into speaker identification techniques. Subsequently, it was realized
that a large body of unconstrained conversational material might be of interest to
researchers working in other speech processing fields. The recordings were transcribed
and the release prepared during 1993. The recordings were made at the Police Staff
College, Bramshill, Hampshire, England. The participants were police officers taking
part in the various courses at the college. This provided a wide range of regional
accents and a range of ages from late teens to early fifties. Each speaker is described
by nine demographic attributes. Three adjacent bedrooms were used. The two participants,
each alone in their rooms, conversed by telephone. The third room was used as a monitoring
and recording station. In addition to the telephone recordings, reference recordings
were made using a high quality dynamic microphone in each room. It is these higher
quality recordings, not the telephone speech, which are provided in the BRAMSHILL
set. The recordings were made on a Sony Elcaset EL-7 cassette machine, chosen at the
time because of its good speed stability. The microphone was a Shure SM-7 cardioid
type. The speech data was sampled at 10 kHz, 16-bit resolution. Some attempt was made
to control the acoustic environment. It is evident from listening to the recordings
that, while these measures produced a reasonable recording environment, the rooms
were far from soundproof. A variety of external noises (engines, aircraft, etc) can
be heard on some of the recordings. Each speaker was given a pile of photographs.
In response to a bleep signal, each speaker introduced himself by name and read a
set of test sentences. After this, the main part of the conversation took place, in
which participants were asked to determine which of each pair of photographs has been
taken first (if indeed they were related at all). The conversations continued for
10 minutes until terminated by another bleep signal. During the digitization process,
some periods of silence were removed, so some recordings now appear to be shorter
than the original ten minutes. Furthermore, this means that recordings of two sides
of a conversation are no longer time-aligned. In addition, to preserve the anonymity
of the speakers, some passages (mainly the introductions) have been erased by replacing
with binary zeroes. Finally the bleep signals have also been erased with binary zeroes.
The transcriptions indicate where this has occurred. The speech was transcribed verbatim.
No attempt was made to correct grammar, fill in missing words etc. Transcription conventions
are detailed in the documentation. Every lexical word from the transcriptions is contained
in the dictionary supplied in the INDEX directory. There are about 6,500 word types
in the 600k words of the transcripts. Contractions, part-words, slang words, hesitation
sounds and the non-speech sounds such are all treated as words in their own right
in the dictionary.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
British Home Office
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630349
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 593-364-872-062-5
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
MACROPHONE consists of approximately 200,000 utterances by 5,000 speakers. It is designed
to provide material sufficient and suitable for research, development and evaluation
of automatic speech recognition technology for common telephone applications, such
as shopping, transportation, database access and autodialing. In addition to application-oriented
phrases and numerous digit strings, seven sentences are spoken by each talker to provide
ensemble phoneme, diphone and triphone coverage of the language. The spoken material
also refers to times, locations, monetary amounts, spellings and interactive operations.
*Data* The utterances were collected automatically over the telephone network by recording
directly from a T1 connection in 8 kHz, 8-bit mu-law format. The participants, roughly
equal numbers of males and females, were solicited by a marketing firm from all regions
of the United States. They ranged in age from the teens to the seventies and represented
a broad range of educations and incomes as well. Each recorded utterance is accompanied
by an orthographic transcription which also notes any unusual acoustic events or anomalies.
Macrophone is the American English contribution to an international database of telephone
speech corpora called POLYPHONE. Similar data sets are expected for major languages
of the world and at least some of these will be made available through LDC. Prospects
are currently good for American Spanish (by early 1995), Dutch, Standard French, Standard
German, Japanese, Mandarin Chinese, Swiss French and Danish versions of POLYPHONE,
all with basically the same structure and methods of collection. MACROPHONE was collected
at SRI under LDC sponsorship. A paper describing it was presented at ICASSP-94: "Macrophone:
An American English Telephone Speech Corpus for the POLYPHONE Project," by Jared Bernstein,
Kelsey Taussig and Jack Godfrey.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bernstein, Jared
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taussig, Kelsey
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, Jack
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u fre d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630381
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94T4A
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 804-587-727-227-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
UN Parallel Text (Complete)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94T4A
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC94T4A - Complete UN Parallel Text corpus LDC94T4B-1 - English text only LDC94T4B-2
- French text only LDC94T4B-3 - Spanish text only This set of three compact discs
contains documents provided to the LDC by the United Nations, for use in research
on machine translation technology. The documents come from the Office of Conference
Services at the UN in New York and are drawn from archives that span the period between
1988 and 1993. This publication contains the English, French and Spanish archives,
with data from each language stored on a separate disc in the set. Care has been taken
to arrange the document files in a parallel directory structure for each language,
so that corresponding translations of a document are found directly by means of the
directory paths and file names. All parallel files in this corpus are English-based:
for every file on the English disc, there will be a corresponding file on either the
French or Spanish disc, or both. Tables are included on all discs to assist in determining
which parallels are present. The total content by language is summarized below (values
are approximate): No. of Millions Language documents of words -------------------------------------
English22,00059 French20,00058 Spanish14,40048 French/Spanish parallel data12,70038
(per language) ------------------------------------- In preparing the text for publication,
we have applied a SGML tagging (Standard Generalized Markup Language) that preserves
all typographic and meta-information that was present in the UN archival files. For
those researchers who use SGML, a working DTD (Document Type Definition) is provided
on each disc. For those who do not need SGML markup, a simple script is included,
for use with the sed (stream-editor) utility, that will filter out the SGML-specific
material and meta-information, leaving only the plain text. (Sed is a standard utility
on unix systems, and is also available as free software for MS-based systems). The
character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some
other non-ASCII characters occupy the upper 128 entries of the character table. Parallel
samples of the three languages in this publication are listed below. * LDC1994T04
English Sample * LDC1994T04 French Sample * LDC1994T04 Spanish Sample Based on the
combined usage of title strings and document numbers, it was possible to identify
parallel sets amounting to over 60% of the data in the archive (a total of 56,684
files in 21,986 parallel sets). We have yet to find a reasonable method for doing
a more careful search for parallels in the remaining 40%. Part of this residue is
due to the fact that this corpus contains only English-based parallel sets parallel
sets that included only French and Spanish versions have not been included in this
release. Users of this corpus must be warned that the parallel sets identified by
this automatic method will include errors. We have observed a number of cases (over
700 in the corpus as a whole) where the members of a parallel set show a serious discrepancy
in quantity of text. Also, we must expect that at least some of these sets (and perhaps
some less obvious cases) constitute a complete mismatch. The reftable files in the
tables directory give an indication of the relative consistency among members of parallel
set in terms of overall size. From these tables, the least likely candidates for parallelism
can be easily identified.
LANGUAGE NOTE
- Language note:
Content in French, English, and Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94T4A
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94T4B-1
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 494-248-767-772-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
UN Parallel Text (English)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94T4B-1
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC94T4A - Complete UN Parallel Text corpus LDC94T4B-1 - English text only LDC94T4B-2
- French text only LDC94T4B-3 - Spanish text only This set of three compact discs
contains documents provided to the LDC by the United Nations, for use in research
on machine translation technology. The documents come from the Office of Conference
Services at the UN in New York and are drawn from archives that span the period between
1988 and 1993. This publication contains the English, French and Spanish archives,
with data from each language stored on a separate disc in the set. Care has been taken
to arrange the document files in a parallel directory structure for each language,
so that corresponding translations of a document are found directly by means of the
directory paths and file names. All parallel files in this corpus are English-based:
for every file on the English disc, there will be a corresponding file on either the
French or Spanish disc, or both. Tables are included on all discs to assist in determining
which parallels are present. Due to the nature and organization of UN translation
services and the original electronic text archives, the process of finding and sorting
out parallel documents yielded a numerous gaps, with many files in each language having
no parallel in other languages. In preparing the text for publication, we have applied
a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers
who use SGML, a working DTD (Document Type Definition) is provided on each disc. For
those who do not need SGML markup, a simple script is included that can be used to
filter out the SGML-specific material and leave only the plain text. The character
set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other
non-ASCII characters occupy the upper 128 entries of the character table.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94T4B-1
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u fre d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630403
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94T4B-2
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 698-675-005-703-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
fre
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
fra
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
UN Parallel Text (French)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94T4B-2
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC94T4A - Complete UN Parallel Text corpus LDC94T4B-1 - English text only LDC94T4B-2
- French text only LDC94T4B-3 - Spanish text only This set of three compact discs
contains documents provided to the LDC by the United Nations, for use in research
on machine translation technology. The documents come from the Office of Conference
Services at the UN in New York and are drawn from archives that span the period between
1988 and 1993. This publication contains the English, French and Spanish archives,
with data from each language stored on a separate disc in the set. Care has been taken
to arrange the document files in a parallel directory structure for each language,
so that corresponding translations of a document are found directly by means of the
directory paths and file names. All parallel files in this corpus are English-based:
for every file on the English disc, there will be a corresponding file on either the
French or Spanish disc, or both. Tables are included on all discs to assist in determining
which parallels are present. Due to the nature and organization of UN translation
services and the original electronic text archives, the process of finding and sorting
out parallel documents yielded a numerous gaps, with many files in each language having
no parallel in other languages. In preparing the text for publication, we have applied
a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers
who use SGML, a working DTD (Document Type Definition) is provided on each disc. For
those who do not need SGML markup, a simple script is included that can be used to
filter out the SGML-specific material and leave only the plain text. The character
set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other
non-ASCII characters occupy the upper 128 entries of the character table.
LANGUAGE NOTE
- Language note:
Content in French. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94T4B-2
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630411
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94T4B-3
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 590-973-820-417-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
UN Parallel Text (Spanish)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94T4B-3
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC94T4A - Complete UN Parallel Text corpus LDC94T4B-1 - English text only LDC94T4B-2
- French text only LDC94T4B-3 - Spanish text only This set of three compact discs
contains documents provided to the LDC by the United Nations, for use in research
on machine translation technology. The documents come from the Office of Conference
Services at the UN in New York and are drawn from archives that span the period between
1988 and 1993. This publication contains the English, French and Spanish archives,
with data from each language stored on a separate disc in the set. Care has been taken
to arrange the document files in a parallel directory structure for each language,
so that corresponding translations of a document are found directly by means of the
directory paths and file names. All parallel files in this corpus are English-based:
for every file on the English disc, there will be a corresponding file on either the
French or Spanish disc, or both. Tables are included on all discs to assist in determining
which parallels are present. Due to the nature and organization of UN translation
services and the original electronic text archives, the process of finding and sorting
out parallel documents yielded a numerous gaps, with many files in each language having
no parallel in other languages. In preparing the text for publication, we have applied
a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers
who use SGML, a working DTD (Document Type Definition) is provided on each disc. For
those who do not need SGML markup, a simple script is included that can be used to
filter out the SGML-specific material and leave only the plain text. The character
set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other
non-ASCII characters occupy the upper 128 entries of the character table.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94T4B-3
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u tur d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630330
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94T5
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 511-168-567-582-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of text/sound track or separate title:
swe
- Language code of text/sound track or separate title:
slv
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
nor
- Language code of text/sound track or separate title:
nob
- Language code of text/sound track or separate title:
nno
- Language code of text/sound track or separate title:
lit
- Language code of text/sound track or separate title:
lat
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
gla
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
est
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
gre
- Language code of text/sound track or separate title:
ger
- Language code of text/sound track or separate title:
dan
- Language code of text/sound track or separate title:
bul
- Language code of text/sound track or separate title:
alb
- Language code of text/sound track or separate title:
may
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
srp
- Language code of text/sound track or separate title:
uzb
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
dut
- Language code of text/sound track or separate title:
cze
- Language code of text/sound track or separate title:
hrv
- Language code of text/sound track or separate title:
alb
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of text/sound track or separate title:
swe
- Language code of text/sound track or separate title:
slv
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
nor
- Language code of text/sound track or separate title:
nob
- Language code of text/sound track or separate title:
nno
- Language code of text/sound track or separate title:
lit
- Language code of text/sound track or separate title:
lat
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
gla
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
est
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ell
- Language code of text/sound track or separate title:
deu
- Language code of text/sound track or separate title:
dan
- Language code of text/sound track or separate title:
bul
- Language code of text/sound track or separate title:
als
- Language code of text/sound track or separate title:
zsm
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
srp
- Language code of text/sound track or separate title:
uzn
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
nld
- Language code of text/sound track or separate title:
ces
- Language code of text/sound track or separate title:
hrv
- Language code of text/sound track or separate title:
sqi
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ECI Multilingual Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94T5
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI),
has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly
92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML
(to varying levels of detail), with easy access to the source text without markup.
Twelve of the component corpora are multilingual parallel corpora with from two to
nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese)
are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7).
The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple
systems at least. The amount of material per language varies, from about 36 million
words (German) to about 5 thousand words (Bulgarian). The majority of sources are
journalistic in nature (newspapers, magazines, broadcasts) additional sources include
dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical
reports and proceedings or publications of international organizations. The table
on the next page lists the languages included, the subcorpus numbers for each language
(in parentheses) and the amount of data per language in thousands of lexical words.
Language (Subcorpus #) Kwords Totals German (70) 34291 (09) 191 (65) 20 (28) 187 (29)
59 (30) 76 (47) 24 (59) 50 (71) 21 (70A) 999 35918 French (31) 4775 (04) 4121 (28)
187 (29) 59 (30) 76 (47) 24 (51) 6 (59) 50 (71) 21 (32) 1667 10986 Spanish (31) 4500
(13) 830 (14) 1041 (15) 447 (47) 24 (32) 1667 8 (59) 50 (71) 8580 English (31) 4222
(36) 1141 (74) 95 (28) 187 (47) 24 (51) 6 (56) 97 (59) 50 (71) 21 (32) 1667 7510 Dutch
(03) 5500 (02) 600 (47) 24 (71) 21 6145 Czech (44) 4726 4726 Italian (11) 3518 (42)
303 (58) 13 (29) 59 (30) 76 (47) 24 (71) 21 4014 Chinese (78) 2895 2895 Greek (10)
2515 (47) 24 (59) 50 (71) 21 2610 Norwegian (41) 2226 2226 Swedish (37) 1718 1718
Serb/Croat/Slov(24) 700 (56) 289 989 Tibetan (76) 834 834 Portuguese (60) 675 (47)
24 (71) 21 720 Malay (80) 563 563 Russian (73) 364 364 Japanese (57) 203 203 Turkish
(20) 173 (20A) 110 283 Albanian (82) 205 205 Gaelic (55) 141 141 Estonian (39) 100
100 Usbek (81) 88 88 Latin (74) 75 75 Danish (47) 24 (71) 21 45 Lithuanian (89) 20
20 Bulgarian (84) 5 5 Total 91969
LANGUAGE NOTE
- Language note:
Content in Turkish, Swedish, Slovenian, Russian, Portuguese, Norwegian, Norwegian
Bokmål, Norwegian Nynorsk, Lithuanian, Latin, Japanese, Scottish Gaelic, French, Estonian,
English, Modern Greek (1453-), German, Danish, Bulgarian, Tosk Albanian, Standard
Malay, Spanish, Serbian, Northern Uzbek, Mandarin Chinese, Italian, Dutch, Czech,
Croatian, and Albanian. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94T5
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630500
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95S22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 155-446-887-889-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
KING Speaker Verification
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95S22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The KING corpus was collected at ITT in 1987 under a US government research contract
and although other contractors have received it, it has not been officially available
for public use before now. The version now available from LDC, referred to as KING-92,
is based on a 1992 reprocessing of the original recordings (see below). It contains
recorded speech from 51 male speakers in two versions, which differ in channel characteristics:
one from a telephone handset and one from a high-quality microphone. The speakers
are further subdivided into two groups, 25 in one and 26 in the other, who were recorded
at different locations. For each speaker and channel there are ten files, corresponding
to sessions of about 30 to 60 seconds' duration each. The interval between sessions
varies from a week to a month. The transcripts contain about 54k word tokens (4.8k
types). KING is designed principally for closed set experiments in text-independent
speaker identification or verification over toll-quality telephone lines, although
the single-sided collection format does not permit simulation of real telephone traffic.
The ten sessions allow for a variety of divisions into training and test data, with
the possibility of multiple test sets. For example, one could examine the effects
of the amount of training on performance, or examine the variability of performance
over several test samples (sessions) given a fixed amount of training (but see below
about the "Great Divide"). *Data* The collection method used in KING was to establish
a call from a laboratory location at ITT (either San Diego, CA or Nutley, NJ) over
long distance lines and back to another phone at the same location. The phones used
by the test subjects were equipped with an additional microphone, so two parallel
recordings were made of that side of the conversation, while the interlocutor's side
was not recorded. The two parties either spoke spontaneously or carried out a variety
of tasks designed to elicit natural-sounding speech: interpreting a drawing, solving
a problem, describing a picture, etc. There were 25 speakers in Nutley and 26 in San
Diego. Speech-to-noise ratios average about 10 dB worse for the Nutley telephone data
than for San Diego; in fact it is less than 20 dB for over half the Nutley files.
Users of this corpus therefore usually run separate experiments, or at least report
results separately, according to site. A more subtle difference in the recordings,
however, sometimes referred to as the "Great Divide," cuts across the telephone data
for the San Diego speakers. This was apparently due to a minor equipment change which
was made during the collection; it results in a slight but consistent change in the
average long term spectrum of the telephone data recorded after the fifth session.
Training and testing on data from the same side of this divide gives significantly
better results than across it. Since the discovery of this difference, investigators
now generally report results on the first and last five sessions of the San Diego
telephone KING data separately, or they report within vs. across this boundary. A
detailed description of the spectral differences can be found in a report by Thomas
Crystal and Ned Neuburg which accompanies the CD-ROM version. Since there are a number
of published papers with results based on the original KING corpus and two versions
of the data in existence, note that the new CD-ROM version, called KING-92, is based
on a 1992 re-issue of the data from ITT. It differs from the original corpus in a
few details: * The original data was sampled at 10 kHz, but has now been resampled
at 8 kHz; * Missing segments, most on the order of seconds, have been restored to
the data and the alignment between the high quality microphone and the telephone handset
data files has been corrected; * Originally both an orthographic and a phonetic transcription
of the data, with time alignments, were part of the corpus, but there were numerous
errors; only an unaligned orthographic transcription has been retained. * Documentation
has been changed to reflect these differences and a description of the artifactual
division between sessions 1-5 and 6-10 in the San Diego telephone data is included.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Higgins, Alan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Vermilyea, Dave
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95S22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630454
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95S23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 388-101-290-949-9
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95S23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The third ARPA Continuous Speech Recognition (CSR) Benchmark Speech Test Collection
is a three CD-ROM set that contains complete development test and evaluation test
suites for speaker-independent, large-vocabulary speech recognition systems. The development
and evaluation tests share a common structure, consisting of two core test components
("hubs") and seven specialized test components ("spokes"). The hub tests, which were
mandatory for all ARPA CSR participants in the November '94 evaluations, provide a
base-line for ASR performance, while the spokes provide the means for assessing the
impact of particular speaking conditions or processing strategies in relation to base-line
performance. Participants were free to take any combination of spoke tests according
to their research interests. Taken together, the collection encompasses 180 speakers,
each producing 20-40 sentences. These are organized into two complete development
test sets and one evaluation set. The collection also includes complete documentation
on the test specifications, data collection procedures, transcriptions and scoring
protocols, together with the latest available version of NIST software for scoring
ASR results and managing SPHERE waveform files. All speech data is accompanied by
both the prompting texts and the detailed orthographic transcriptions of the utterances.
This was the first ARPA CSR Benchmark Test in which prompting texts were drawn from
a variety of news sources. Whereas earlier benchmarks were based on Wall Street Journal
excerpts (from the period 1987-89), CSR-III prompts come a variety of North American
Business News Services: Reuters News Service, New York Times, Wahington Post and Los
Angeles Times as well as WSJ; all texts are drawn from financial news articles written
during the period of April through June, 1994. (NAB stands for "North American Business,"
in contrast to earlier benchmarks and training collections labeled "WSJ"). An important
companion to the 1994 Benchmark Speech data collection is the four-disk CSR-III Text
Collection (LDC95T6), which includes the ARPA CSR 1994 Standard Language Model. This
corpus is also available from the LDC as a 1995 release. Because of restrictions imposed
by the copyright holders of much of the NAB text, both the speech and text collections
are available to LDC members only. For more information on how to join, send email
to ldc@ldc.upenn.edu.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95S23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630586
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95S24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 500-945-172-283-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
WSJCAM0 Cambridge Read News
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95S24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition
(The Cambridge University Version of the ARPA CSR Corpus WSJ0). This release of WSJCA0
represents version 1.1 of the corpus, which was initially released on tape by Cambridge
University as of August 31, 1994. This collection was modelled directly on the ARPA
CSR Corpus released by LDC in 1993: it used the same dual-microphone recording paradigm
and a subset of prompting texts drawn from the Wall Street Journal. There are two
key differences between WSJ0 and WSJCAM0: (1) the subjects in WSJCAM0 were native
speakers of British English and (2) in addition to standard orthographic transcripts,
WSJCAM0 also has information on the time alignment between the sampled waveform and
both the words and the phonetic segments. The contents of the publication consist
of the following: * Training data from head-mounted microphone * Development test
data from head-mounted microphone, plus first set of evaluation test data * Training
data from desk-mounted microphone * Development test data from desk-mounted microphone,
plus second set of evaluation test data There are 90 utterances from each of 92 speakers
that are designated as training material for speech recognition algorithms. An additional
48 speakers each read 40 sentences containing only words from a fixed 5,000 word vocabulary
and another 40 sentences using a 64,000 word vocabulary, to be used as testing material.
Each of the total of 140 speakers also recorded a common set of 18 adaptation sentences.
Recordings were made from two microphones: a far-field desk microphone and a head-mounted
close-talking microphone. Within the train and test sets, speech data are organized
by speaker prompting texts and detailed transcriptions and speaker information are
included in each speaker directory. All waveform files have NIST SPHERE headers. Waveform
data are compressed using the Shorten algorithm developed by Tony Robinson at Cambridge
University, as adapted for use in the NIST SPHERE software package.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Robinson, Tony
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fransen, Jeroen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pye, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Foote, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Renals, Steve
ADDED ENTRY--PERSONAL NAME
- Personal name:
Woodland, Phil
ADDED ENTRY--PERSONAL NAME
- Personal name:
Young, Steve
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95S24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630578
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95S25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 070-132-331-927-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TRAINS Spoken Dialog Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95S25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release contains a corpus of task-oriented spoken dialogs. These dialogs were
collected in 1993 at the University of Rochester Department of Computer Science as
part of the TRAINS project, a project to develop a conversationally proficient planning
assistant, which helps a user construct a plan to achieve some task involving the
manufacturing and shipment of goods in a railroad freight system. The collection procedure
was designed to make the setting as close to human-computer interaction as possible,
but was not a "wizard" scenario, where one person pretends to be a computer. Thus
these dialogs provide a snapshot into an ideal human-computer interface that would
be able to engage in fluent conversations. Altogether, this corpus includes 98 dialogs,
collected using 20 different tasks and 34 different speakers. This amounts to six
and a half hours of speech, about 5,900 speaker turns and 55,000 transcribed words.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Allen, James
ADDED ENTRY--PERSONAL NAME
- Personal name:
Heeman, Peter A.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95S25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630438
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95S26
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 847-846-823-557-6
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95S26
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release contains a corpus of speech and natural language data collected under
the auspices of the Advanced Research Projects Agency Spoken Language Systems (ARPA-SLS)
technology development program. The corpus, which contains data in the Air Travel
Information Services (ATIS) domain, was designed by the ARPA-SLS Multi-site Atis Data
COllection Working (MADCOW) group and was collected by five sites at locations across
the U.S.: * BBN Systems & Technologies, Cambridge, MA * Carnegie Mellon University,
Pittsburgh, PA * MIT Laboratory for Computer Science, Boston, MA * National Institute
of Standards and Technology, Gaithersburg, MD * SRI International, Menlo Park, CA
The corpora is part of the third phase of collection of ATIS data (ATIS3) and comprises
the development test (NIST Speech Disc 17-4.2) and evaluation test material (NIST
Speech Disc 17-5.1) used in the December 1994 ARPA SLS Benchmark Tests. As in the
previous ATIS corpora, the speech contained in this corpus was elicited by presenting
subjects with various hypothetical travel planning scenarios to solve. The resulting
spontaneous spoken queries were recorded as the subjects interacted with partially
or completely automated ATIS systems to solve the scenarios. Note that the ATIS3 training
data is available on NIST Speech Discs 17-1.1 - 17-3.1. *Data* The recorded speech
has been transcribed and annotated with categorizations and canonical reference answers.
All of the utterances have been recorded using a close-talking, noise-canceling head-mounted
Sennheiser microphone. For some subjects, secondary (noisier) microphone data was
recorded simultaneously as well. This release also contains the ATIS3 46 city/52 airport
relational database, a revised Principles of Interpretation and test implementation
and scoring instructions as well as other general documentation. The ATIS3 corpus
has been verified, collated, documented by the National Institute of Standards and
Technology (NIST) in cooperation with MADCOW and distributed by the Linguistic Data
Consortium (LDC).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dahl, Deborah A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bates, Madeleine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brown, Michael
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hunicke-Smith, Kate
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pao, Christing
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rudnicky, Alexander
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shriberg, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Danielson, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bocchieri, Enrico
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buntschuh, Bruce
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schwartz, Beverly
ADDED ENTRY--PERSONAL NAME
- Personal name:
Peters, Sandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ingria, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Weide, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chang, Yuzong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Thayer, Eric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hirschman, Lynette
ADDED ENTRY--PERSONAL NAME
- Personal name:
Polifroni, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lund, Bruce
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kawai, Goh
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Norton, Lew
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95S26
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630551
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95S27
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 574-104-816-534-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
PhoneBook: NYNEX Isolated Words
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95S27
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
PhoneBook is a phonetically-rich, isolated-word, telephone-speech database, created
because of (1) the lack of available large-vocabulary isolated-word data, (2) anticipated
continued importance of isolated-word and keyword-spotting technology to speech-recognition-based
applications over the telephone and (3) findings that continuous-speech training data
is inferior to isolated-word training for isolated-word recognition. The goal of PhoneBook
is to serve as a large database of American English word utterances incorporating
all phonemes in as many segmental/stress contexts as are likely to produce coarticulatory
variations, while also spanning a variety of talkers and telephone transmission characteristics.
We anticipate that it will be useful in ways analogous to TIMIT/NTIMIT. The core section
of PhoneBook consists of a total of 93,667 isolated-word utterances, totalling 23
hours of speech. This breaks down to 7,979 distinct words, each said by an average
of 11.7 talkers, with 1,358 talkers each saying up to 75 words. All data were collected
in 8-bit mu-law digital form directly from a T1 telephone line. Talkers were adult
native speakers of American English chosen to be demographically representative of
the U.S. Given the large set of talkers being recruited for PhoneBook database, it
made sense to exploit the opportunity to collect additional utterances. We have chosen
spontaneous numerical utterances, because of widespread interest in them and the need
for very large numbers of talkers for research into spontaneous-speech effects. We
restricted to just three spontaneous digit sequences and one money amount, as the
lists for the core of PhoneBook have been designed to approach the limit of reasonable
duration for a caller's session. As a result, PhoneBook contains a total of 5,105
spontaneous utterances.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pitrelli, John F.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fong, Cynthia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95S27
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630519
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95S28
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 355-182-932-307-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
LATINO-40 Spanish Read News
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95S28
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This database provides a set of recordings for training speaker-independent systems
that recognize Latin-American Spanish. It was recorded by the Entropic Research Laboratory
in the period from July 11 through September 9 1994 in Palo Alto, California. The
database comprises about 5,000 utterances files. These files include about 125 utterances
from each of 40 different speakers, 20 male and 20 female. The recordings were all
made with a high-quality, head-mounted microphone (Shure SM10A) in an office environment,
and the utterances were digitized in 16-bit samples at 16 kHz. The Linguistic Data
Consortium provided 13,000 sentences that had been selected from Latin American newspaper
text by people working at Texas Instruments. The sentences are all shorter than 80
characters and are not grouped into larger constituents such as paragraphs or stories.
The speech files have NIST SPHERE headers and are presented in compressed format,
using the shorten speech compression algorithm developed by Tony Robinson at Cambridge
Univesity, as implemented in the NIST SPHERE software package. This software is included
with the data.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bernstein, Jared
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grundy, Bill
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rosenfeld, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Najmi, Amir
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mankoski, Psi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95S28
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u por d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630470
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 082-576-700-069-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
European Language Newspaper Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The European Language Newspaper Text corpus is also know as the French Language News
Corpus. This corpus includes roughly 100 million words of French, 90 million words
of German and 15 million words of Portuguese and has been marked using SGML. The text
is taken from the following sources: * Approximately 60 million words of text in French
and German have been made available from the Associated Press (AP) World Stream. AP
World Stream is a compilation of AP news reports produced in 86 bureaus in 68 countries.
The Associated Press Worldstream newswire service provides articles in six languages,
interleaved on a single data stream. The data is collected via an Associated Press
installed telephone line at the LDC. * Approximately 110 million words of text in
French, German and Portuguese have been made available from Agence France Presse.
Each language was supplied in separate data streams collected via a Dateno MKII satellite
receiver and associated equipment at the LDC. * Approximately 20 million words of
text in German have been made available from Deutsche Presse Agentur. The text is
collected via an AP Datafeatures telephone line installed at the Linguistic Data Consortium.
* A smaller part of the corpus comes from Le Monde newspaper. The Le Monde data covers
about 5.6 million words of French. It is quite distinct from the AP and AFP materials
in its markup approach, because it has been prepared in compliance with the conventions
of the Text Encoding Initiative (TEI), rather than having been based on the model
of the TIPSTER collections, which were originally developed prior to the establishment
of the TEI conventions.
LANGUAGE NOTE
- Language note:
Content in Portuguese, French, and German. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630527
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 133-578-348-091-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Mandarin Chinese News Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Linguistic Data Consortium (LDC) announces the availability of a Mandarin Chinese
text corpus. This corpus includes about 250 million GB-encoded text characters. The
Mandarin News Corpus includes text from various journalistic sources: * newspaper
text from Renmin Ribao (People's Daily) * radio scripts from China Radio International
* newswire text from Xinhua newswire service The format of this corpus uses a labeled
bracketing, expressed in the style of SGML (Standard Generalized Markup Language).
The header fields provided by the sources, which give information such as topic, date
and article ID, have been retained. The articles cover a variety of topics, including
international and domestic news, sports and culture.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wu, Zhibiao
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u fre d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630489
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 711-183-299-010-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Hansard French/English
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Hansard Corpus consists of parallel texts in English and Canadian French, drawn
from official records of the proceedings of the Canadian Parliament. While the content
is therefore limited to legislative discourse, it spans a broad assortment of topics
and the stylistic range includes spontaneous discussion and written correspondance
along with legislative propositions and prepared speeches. The collection presented
here has been assembled by the LDC by way of archives from two distinct secondary
sources. Material from one time period of parliamentary proceedings was acquired through
the IBM T. J. Watson Research Center, while material from another period was acquired
through Bell Communications Research Inc. (Bellcore). The combined collection covers
a time span from the mid-1970's through 1988, with no apparent duplication between
the two data sources. Aside from covering different time periods, the two archives
have different organization and have undergone different amounts and kinds of processing
in being prepared as a parallel language resource. In addition, the Bellcore set itself
comprises two distinct types of data -- one appears to be the main parliamentary proceedings
(similar in nature to the IBM set), while the other consists of transcripts from committee
hearings. The three sets have been kept distinct in this publication and each is described
in greater detail in separate documentation files. In terms of what the three sets
have in common: * They are rendered here using the 8-bit ISO-Latin1 character encoding
standard. * They use a minimal amount of SGML tagging to identify sentences or paragraphs.
* All sets are organized using a parallel file structure, in which the content of
a given English text file is matched by the content of a corresponding French text
file. * The SGML text files for the IBM and the Bellcore committee-hearings data are
published in compressed form, using the public-domain GNU-Zip utility (gzip). The
Bellcore main-session files are not compressed. In terms of differences between the
three sets: * The IBM collection is presented as a sequence of parallel sentences
(there are nearly 2.87 million parallel sentence pairs in the set). * The Bellcore
data are presented as sequences of paragraphs. * The Bellcore main-session data is
accompanied by mapping files that provide computed paragraph alignments and word-token
correspondences; no additional alignment data are provided for the Bellcore committee
texts (and none are needed for the IBM sentences).
LANGUAGE NOTE
- Language note:
Content in French and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Roukos, Salim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Melamed, Dan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630535
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95T21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 667-148-284-023-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
North American News Text Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95T21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The North American News Text corpus is composed of news text that has been formatted
using TIPSTER-style SGML markup. The text is taken from the following sources: Source
Dates Approx. # Words Covered (Millions) -------------------------------------------------------
Los Angeles Times & 05/94-08/97 52 Washington Post New York Times News 07/94-12/96
173 Syndicate Reuters News Service 04/94-12/96 85 (General & Financial) Wall Street
Journal 07/94-12/96 40 ------------------------------------------------------- Both
the New York Times and the L. A. Times/Washington Post services actually include a
range of other newspaper sources in their syndicated newswires. The L. A. Times/Washington
Post material will be found to include the following sources (in lesser amounts) in
addition to the two predominant sources: * Newsday * The Baltimore Sun * The Hartford
Courant The New York Times material will be found to contain the following sources
(in lesser amounts), but N.Y. Times articles predominate: * Bloomberg Business News
* The Boston Globe * Los Angeles Daily News * Fort Worth Star-Telegram * Newsweek
* Cox News Service * The Arizona Republic * Seattle Post-Intelligencer * San Francisco
Examiner * Houston Chronicle * San Francisco Chronicle * Economist Newspaper Ltd.
* Hearst Newspapers Both of these newswire services also include small numbers of
articles from a larger set of miscellaneous sources. The ones listed above appear
with some frequency on a daily basis.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95T21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630462
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95T6
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 080-723-177-118-4
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95T6
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The third ARPA Continuous Speech Recognition (CSR) Language Model Training Data is
a set for speaker-independent, large-vocabulary speech recognition systems. This corpus
is an important companion to the 1994 Benchmark Speech data collection (LDC95S23).
The text collection comprises both source text data (prepared by LDC and BBN) and
derived statistical tables (compiled by CMU) of unigram, bigram and trigram word frequencies.
The sources include all available WSJ texts, spanning 1987 through March 1994 and
all AP and San Jose Mercury news data from the three TIPSTER volumes. (Some of the
WSJ data, from 1992 through 1994, appears here for research use for the first time).
This corpus is also available from the LDC as a 1995 release. Because of restrictions
imposed by the copyright holders of much of the NAB text, both the speech and text
collections are available to LDC members only. For more information on how to join,
send email to ldc@ldc.upenn.edu.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rosenfeld, Roni
ADDED ENTRY--PERSONAL NAME
- Personal name:
Paul, Doug
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95T6
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630543
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95T7
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 650-146-755-602-3
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95T7
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Original release was: LDC Catalog No.: LDC94T4B-3.1 NIST Catalog No.: NA LDC Release
date: 4/94 (MY94) Original Treebank Release This release contains over 1.6 million
words of hand-parsed material from the Dow Jones News Service, plus an additional
one million words tagged for part-of-speech. This material is a subset of the language
model corpus for the DARPA CSR large-vocabulary speech recognition project. It also
contains the first fully parsed version of the Brown Corpus, which has also been completely
retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed
data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS. In
addition, the release includes source code for programs that were used by the PTB
project in creating portions of the data. Source code is also included for "tgrep,"
a program that permits the user to search for specific constituents in tree structures.
All software is provided "as is." (We have learned since publication that the tgrep
source code provided on the cd-rom is not readily portable, and compiling tgrep requires
modification of the source files. Also included is a pre-compiled program file for
tgrep, built for use on Sun sparc systems.) Release - 2 The PTB Project Release 2
features the new PTB-2 bracketing style, which is designed to allow the extraction
of simple predicate/argument structure. Over one million words of text are provided
with this bracketing applied, along with a complete style manual explaining the bracketing
and new versions of tools for searching and treating bracketed data. This release
also contains all the annotated text material from the earlier Treebank Preliminary
Release, including the Brown Corpus. While these materials have not all been converted
to the newer bracketing style, they have been cleaned up to remove problems that had
appeared in the earlier release. The contents of Treebank Release 2 are as follows:
* One million words of 1989 Wall Street Journal material annotated in Treebank-2 style.
* A small sample of ATIS-3 material annotated in Treebank-2 style. * 300-page style
manual for Treebank-2 bracketing, as well as the part-of-speech tagging guidelines.
* The contents of the previous Treebank release (Version 0.5), with cleaner versions
of the WSJ, Brown Corpus, and ATIS material (annotated in Treebank-1 style). * Tools
for processing Treebank data, including "tgrep," a tree-searching and manipulation
package (note that usability of this release of tgrep is limited: users of Sun sparc
systems should have no problem, but others may find the software to be difficult or
impossible to port). In addition, the PTB Project has provided some updates, announcements
and a discussion forum for users. A file of updates and further information is available
via anonymous FTP from ftp.cis.upenn.edu, in pub/treebank/doc/update.cd2. The PTB
project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection
of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed
in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2
& Treebank-3 both include the raw text for each story. Three "map" files are available
in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for
users who have licensed Treebank-2 and provide the relation between the 2,499 PTB
filenames and the corresponding WSJ DOCNO strings in TIPSTER.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitchell P.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Santorini, Beatrice
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcinkiewicz, Mary Ann
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95T7
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630497
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95T8
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 177-142-310-728-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Japanese Business News Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95T8
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Linguistic Data Consortium announces the availability of a Japanese language text
corpus composed of business and financial news from two sources: * Approximately 30
million words of text have been made available from the morning edition of Nihon Kezai
Shimbun, the largest Japanese financial news daily newspaper; the release this year
covers all text that was published during 1994. The data was received at the LDC on
nine-track magnetic tape; the character encoding was EBCDIC, but was standardized
to EUC, which the LDC has chosen as its standard for Japanese. * A smaller part of
the corpus comes from Dow Jones Telerate, which markets its Japanese Language Service.
This is a financial newswire produced by Kyodo News Service; its recipients are primarily
managers of Japanese owned corporations, or Japanese employees working in North American
brokerage houses, banking, etc. The text is received at the LDC via a digital transmission
service installed by Telerate; special software was written by the LDC to poll a central
database and download articles individually. The character encoding is EUC. This corpus
is available to LDC members only.
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wu, Zhibiao
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95T8
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u spa d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC95T9
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 673-814-501-585-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC95T9
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Spanish News Corpus consists of journalistic text data from one newspaper (El
Norte, Mexico) and from the Spanish-language services of three newswire sources: Agence
France Presse, Associated Press Worldstream, and Reuters. (The Reuters collection
comprises two distinct services: Reuters Spanish Language News Service and Reuters
Latin American Business Report). All text data are stored in a standard compressed
form. The fours sets of newswire data (AFP, APWS and two Reuters services) are each
organized as one data file per day of collection. The period covered by these collections
runs from December 1993 (for APWS and Reuters) or May 1994 (APWS) through December
1995. (The El Norte data, provided to us by INFOSEL Mexico, are arbitrarily grouped
into files of about 1 megabyte in size when uncompressed; date information is not
available for individual articles, but the general period of the collection is 1993).
The approximate amounts of data per source (when uncompressed) is indicated below
(in total megabytes and millions of words of text): Source MB MW AFP 345 44 APWS 253
33 REUSL 333 41 REULA 233 23 INFOSEL 209 31 The presentation of text data in these
collections is modeled on the TIPSTER corpus. Within each data file, SGML tagging
is used (1) to mark article boundaries, (2) to delimit the text portion within each
article and (3) to label various pieces of information about the article that are
external to the text content (e.g. headlines, bylines and so on). The copyright holders
of this text have requested that it be made available to LDC members only. Due to
the release date this corpus is available to 1995 and 1996 members. In order to obtain
this corpus, current LDC members must submit a signed User Agreement Form.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gallegos, Gustavo
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC95T9
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1995 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630853
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96L14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 204-698-863-053-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ger
- Language code of text/sound track or separate title:
dut
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
deu
- Language code of text/sound track or separate title:
nld
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1995]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96L14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus contains ASCII versions of the CELEX lexical databases of English (Version
2.5), Dutch (Version 3.1) and German (Version 2.0). CELEX was developed as a joint
enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden,
the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for
Perception Research in Eindhoven. Pre-mastering and production was done by the LDC.
For each language, this data set contains detailed information on: * orthography (variations
in spelling, hyphenation) * phonology (phonetic transcriptions, variations in pronunciation,
syllable structure, primary stress) * morphology (derivational and compositional structure,
inflectional paradigms) * syntax (word class, word class-specific subcategorizations,
argument structures) * word frequency (summed word and lemma counts, based on recent
and representative text corpora) The databases have not been tailored to fit any particular
database management program. Instead, the information is in ASCII files in a UNIX
directory tree that can be queried with tools, such as AWK or ICON. Unique identity
numbers allow the linking of information from different files. Some kinds of information
have to be computed online; wherever necessary, AWK functions have been provided to
recover this information. README files specify the details of their use. A detailed
User Guide describing the various kinds of lexical information available is supplied.
All sections of this guide are POSTSCRIPT files, except for some additional notes
on the German lexicon in plain ASCII. CELEX-2 The second release of CELEX contains
an enhanced, expanded version of the German lexical database (2.5), featuring approximately
1,000 new lemma entries, revised morphological parses, verb argument structures, inflectional
paradigm codes and a corpus type lexicon. A complete PostScript version of the Germanic
Linguistic Guide is also included, in both European A-4 format and American Letter
format. For German, the total number of lemmas included is now 51,728, while all their
inflected forms number 365,530. Moreover, phonetic syllable frequencies have been
added for (British) English and Dutch. Apart from this, and provision of frequency
information alongside every lexical feature, no changes have been made to Dutch and
English lexicons. Complete AWK-scripts are now provided to compute representations
not found in the (plain ASCII) lexical data files, corresponding to the features described
in CELEX User Guide, which is included as well. For each language, i.e. English, German
and Dutch, the data contains detailed information on the orthography (variations in
spelling, hyphenation), the phonology (phonetic transcriptions, variations in pronunciation,
syllable structure, primary stress), the morphology (derivational and compositional
structure, inflectional paradigms), the syntax (word class, word-class specific subcategorisation,
argument structures) and word frequency (summed word and lemma counts, based on resent
and representative text corpora) of both wordforms and lemmas. Unique identity numbers
allow the linking of information from different files with the aid of an efficient,
index-based C-program. Like its predecessor, this release is mastered using the ISO
9660 daa format, with the Rock Ridge extensions, allowing it to be used in VMS, MS-DOS,
Macintosh and UNIX environments. As the new release does not omit any data from the
first edition, the current release will replace the old one.
LANGUAGE NOTE
- Language note:
Content in English, German, and Dutch. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Baayen, R.H.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Piepenbrock, R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gulikers, L.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96L14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630799
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96L15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 969-490-893-990-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Mandarin Chinese Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96L15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME Mandarin Chinese collection includes a lexical component. The CALLHOME
Mandarin Lexicon consists of 44,405 words and contains separate information fields
with phonological, morphological and frequency information for each word. The token
coverage by the LDC Mandarin lexicon of words occurring in the 20 LDC Mandarin CALLHOME
devtest transcripts (ten minutes of conversation each) is 98%. Orthographic Chinese
characters are GB-encoded and are simplified in the Mainland style. A representation
of the headword in tone pinyin with strictly lexical tone, i.e. not reflecting phonetic/phonological
processes is also provided. Here is a sample page from the lexicon. The transcripts
and documentation (LDC96T16) are available separately, as is a corpus of telephone
speech (LDC96S34).
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bian, Xuejun
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
McLemore, Cynthia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96L15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630829
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96L16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 411-575-699-412-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Spanish Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96L16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME Spanish collection includes a lexical component. The CALLHOME Spanish
Lexicon consists of 45,582 words and contains separate information fields with phonological,
morphological and frequency information for each word. The token coverage by the LDC
Spanish lexicon of words occurring in the 20 LDC Spanish CALLHOME devtest transcripts
(ten minutes of conversation each) is 98.7%. For examples of listings from the Lexicon,
please look at the following samples pages:sample1 sample2 The transcripts and documentation
(LDC96T17) are available separately, as is a corpus of telephone speech (LDC96S35).
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garrett, Susan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morton, Tom
ADDED ENTRY--PERSONAL NAME
- Personal name:
McLemore, Cynthia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96L16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630764
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96L17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 268-800-139-007-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Japanese Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96L17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME Japanese collection includes a lexical component. The CALLHOME Japanese
Lexicon consists of 80,688 words and contains separate information fields with morphological,
phonological and stress information for each word. The lexicon is distributed via
FTP. A verbal analyzer/synthesizer (transducer) is also included. Here is a sample
page from the lexicon. The transcripts and documentation (LDC96T18) are available
separately, as is a corpus of telephone speech (LDC96S37).
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Japanese language
- Form subdivision:
Databases.
- Geographic subdivision:
North America
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Japanese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kobayashi, Megumi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Crist, Sean
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kaneko, Masayo
ADDED ENTRY--PERSONAL NAME
- Personal name:
McLemore, Cynthia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96L17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630918
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S29
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 311-510-096-477-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Frontiers in Speech Processing 93
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S29
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This CD reflects the cooperative efforts of 28 researchers who attend the 1993 summer
workshop in speech processing hosted by the Center for Computer Aids for Industrial
Productivity (CAIP) at Rutgers University and sponsored by the National Security Agency.
The workshop was an outgrowth of summers at the Center for Communication Research
in Princeton (CCR-P) and targeted problems concerning general-purpose speech recognition
with particular emphasis on front end processing. The project was held from July 6th
to August 13th and utilized extensive computational resources: both equipment native
to CAIP and additional hardware acquired for the workshop.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S29
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630888
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S30
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 485-058-109-556-2
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S30
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CTIMIT is a cellular-bandwidth adjunct to TIMIT Acoustic Phonetic Continuous Speech
(LDC93S1). CTIMIT has been designed to provide a large phonetically-labeled database
for use in the design and evaluation of speech processing systems operating in diverse,
often hostile, cellular telephone environments. CTIMIT was collected by members of
the Voice Communication Initiative (VCI) at Lockheed-Martin Sanders' Signal Processing
Center of Technology (SPCOT) as part of internal R&D efforts, with additional sponsorship
from the Wireless Communications Group in the company's Advanced Engineering and Technology
(AE&T) Division. Like NTIMIT (LDC93S2), CTIMIT is based on the original TIMIT recordings,
which were passed through a sample of actual telephone circuits -- cellular circuits
in the case of CTIMIT. Thus, the original phonetic segmentation and labeling of TIMIT
continue to be applicable to CTIMIT as well as NTIMIT.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
George, E. Bryan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brown, Kathy L.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Birnbaum, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Macon, Michael
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S30
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S31
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 440-074-007-959-3
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S31
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release contains all of the speech data provided to sites participating in the
DARPA CSR November 1995 HUB4 (Radio) Broadcast News tests. The data consists of digitized
waveforms of MarketPlace (tm) business news radio shows provided by KUSC through an
agreement with the Linguistic Data Consortium and detailed transcriptions of those
broadcasts. The software NIST used to process and score the output of the test systems
is also included. The data is organized as follows: CD26-1: Training Data-Ten complete
half-hour broadcasts with minimal-verified transcripts. The transcripts are time aligned
with the waveforms at the story-boundary level. CD26-2: Development-Test Data-Six
complete half-hour broadcasts with verified transcripts. The transcripts are time
aligned with the waveforms at the story- and turn-boundary level. Index files have
been included which specify how the data may be partitioned into 2 test sets. CD26-6
Evaluation-Test Data-Five complete half-hour broadcasts with verified/adjudicated
transcripts. The transcripts are time aligned with the waveforms at the story-, turn-
and music-boundary level. An index file has been included which specifies how the
data was partitioned into the test set used in the CSR 1995 HUB4 tests.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S31
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S32
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 671-675-113-200-7
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S32
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
FFMTIMIT contains the previously unreleased secondary microphone waveforms for TIMIT
Acoustic-Phonetic Continuous Speech. The primary microphone waveforms, which were
recorded using a close-talking noise-cancelling head-mounted Sennheiser microphone
(model HMD-414), are available from LDC on NIST Speech Disc 1-1.1 (LDC93S1). The secondary
microphone used in the recording of the TIMIT corpus was a Breul & Kjaer (B&K) 1/2"
free-field microphone (model 4165). While the Sennheiser microphone recordings are
relatively "clean" with respect to non-speech noise, the FFMTIMIT recordings include
significant low frequency noise, which was due to the HVAC system and mechanical vibration
transmitted through the floor of the double-walled sound booth used in recording.
Because it is noiser than its TIMIT counterpart, the data of FFMTIMIT may be used
in the development of more noise-robust speech recognition systems. In addition, this
data may be of value to researchers involved in vocal tract modeling because the B&K
microphone has extremely flat free-field frequency response and calibration tones
are provided. Note that the B&K TIMIT data contained with this release has not been
processed through any highpass filter, (e.g., the 1,581-point filter described in
the paper "The DARPA Speech Recognition Research Database" by Fisher, Doddington and
Goudie-Marshall in "DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM,"
NISTIR 4930 / NTIS Order No. PB93- 173938.)
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lamel, Lori F.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dahlgren, Nancy L.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zue, Victor
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S32
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630861
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S33
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 529-082-231-699-3
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S33
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This set of CD-ROMs contains all of the speech data provided to sites participating
in the DARPA CSR November 1995 HUB3 Multi-Microphone tests. The data consists of digitized
waveforms collected with eight different microphones simultaneously from 40 subjects
reading 15 sentence articles drawn from various North American business news publications.
The data is partitioned into development-test and evaluation-test sets. The test sets
were collected with different subjects, prompts and microphones. No training data
was collected for this corpus since a substantial amount of NAB acoustic training
data was already available. Index files have been included that specify the exact
subset of the evaluation test recordings which were used in the November 1995 tests.
The software NIST used to process and score the output of the tests systems is also
included. The data is organized as follows: CD26-3 Development-Test Data-Location
1, Adaptation and NAB recordings, Subjects:703-705, 707-70a, 70c, 70f, 70g CD26-4
Development-Test Data-Location 2, NAB recordings, Subjects:70k, 70m, 70o, 70q-70s,
70u-70w CD26-5 Development-Test Data-Location 2, Adaptation recordings, Subjects:70k
70m-70o, 70q-70s, 70u-70w CD26-3 Development-Test Data-NAB recordings, Subjects:710-71j
As of September, 2007 this publication has been condensed to fit on a single DVD.
The data on each CD resides in its own directory labeled with the above NIST labels.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S33
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630802
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S34
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 969-755-457-598-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Mandarin Chinese Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S34
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME Mandarin Chinese corpus of telephone speech consists of 120 unscripted
telephone conversations between native speakers of Mandarin Chinese. All calls, which
lasted up to 30 minutes, originated in North America and were placed to locations
overseas. Most participants called family members or close friends. *Data* This corpus
contains speech data files only, along with documentation that describes the contents
and format of the speech files and the software packages needed to uncompress the
speech data. The transcripts and documentation (LDC96T16) are available separately,
as is an associated lexicon (LDC96L15).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S34
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630837
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S35
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 321-477-528-167-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Spanish Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S35
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME Spanish corpus of telephone speech consists of 120 unscripted telephone
conversations between native speakers of Spanish. All calls, which lasted up to 30
minutes, originated in North America and were placed to international locations. Most
participants called family members or close friends. This corpus contains speech data
files ONLY, along with the minimal amount of documentation needed to describe the
contents and format of the speech files and the software packages needed to uncompress
the speech data. The transcripts and documentation (LDC96T17) are available separately,
as is an associated lexicon (LDC96L16).
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S35
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630608
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S36
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 601-939-678-076-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Boston University Radio Speech Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S36
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Boston University Radio Speech Corpus was collected primarily to support research
in text-to-speech synthesis, particularly generation of prosodic patterns. The corpus
consists of professionally read radio news data, including speech and accompanying
annotations, suitable for speech and language research. The corpus includes speech
from seven (four male, three female) FM radio news announcers associated with WBUR,
a public radio station. The main radio news portion of the corpus consists of over
seven hours of news stories recorded in the WBUR radio studio during broadcasts over
a two year period. In addition, the announcers were also recorded in a laboratory
at Boston University. In this, the lab news portion, the announcers read a total of
24 stories from the radio news portion. The announcers were first asked to read the
stories in their non-radio style and then, 30 minutes later, to read the same stories
in their radio style. Each story read by an announcer was digitized in paragraph size
units, which typically include several sentences. The files were digitized at a 16k
Hz sample rate using a 16-bit A/D. The paragraphs were annotated with the orthographic
transcription, phonetic alignments, part-of-speech tags and prosodic markers. The
orthographic transcripts were generated by hand and include indication of where the
speaker took a breath. The phonetic alignments and part-of-speech tags were generated
automatically and hand corrected. The prosodic labels were marked by hand and are
available only for a subset of the corpus. A zipped compressed file example.zip is
available. Please be aware that this file is slightly larger than 1 Mb (1,278,998
bytes). An additional sample file, LDC1996.tgz and WAV sample are also available.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ostendorf, Mari
ADDED ENTRY--PERSONAL NAME
- Personal name:
Price, Patti
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shattuck-Hufnagel, Stefanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S36
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630772
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S37
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 365-589-437-035-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Japanese Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S37
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME Japanese corpus of telephone speech consists of 120 unscripted telephone
conversations between native speakers of Japanese. All calls, which lasted up to 30
minutes, originated in North America and were placed to locations overseas (typically
Japan). Most participants called family members or close friends. *Data* This corpus
contains speech data files ONLY, along with the minimal amount of documentation needed
to describe the contents and format of the speech files and the software packages
needed to uncompress the speech data. The transcripts and documentation (LDC96T18)
are available separately, as is an associated lexicon and transducer (LDC96L17).
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Japanese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Japanese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S37
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630896
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S38
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 139-466-600-760-1
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S38
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release contains the materials used to collect all 216 spoken dialogues digital
audio, orthographic transcriptions, documentation and source code for tools. The dialogues
were selected to provide balanced representation at different points in a sleep deprivation
experiment. *Data* The materials have been designed to be easily accessible to users
with different equipment and a variety of needs from those who merely wish to generate
hardcopies of the orthographic transcriptions to those who require computational analyses
of the speech material. All the text files (transcriptions and documentation) should
be readable and printable via most systems. The maps are intended for printing via
POSTSCRIPT printers and the speech files are provided with human-readable standard
headers, enabling them to be played by a wide range of environments for processing
sampled speech. Samples Please view this speech sample and transcript sample.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taylor, Martin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bard, Ellen Gurman
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sotillo, Cathy
ADDED ENTRY--PERSONAL NAME
- Personal name:
McKelvie, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Anderson, Anne
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S38
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S39
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 819-670-687-754-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
RM Isolated and Spelled Word Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S39
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release contains previously unreleased isolated-word and spell-mode (spelled
out words) speech data from the (D)ARPA Resource Management (RM1) Corpus. This data
is based on a 600-word subset of the 991-word RM1 vocabulary and contains spoken and
spelled words pertaining to the RM1 naval resource management task. This corpus was
collected simultaneously as part of the RM1 Continuous Speech Corpus (NIST Speech
Discs 2-1-2-4) and contains speech from the same sets of subjects used in RM1. *Data*
The speech data has been segmented into separate spelled and spoken-word waveform
files for each subject-word utterance. Time-aligned word and phonetic transcriptions
have been generated automatically using forced recognition and are included. The time-aligned
transcriptions employ the same format and phone set as the TIMIT Acoustic-Phonetic
Continuous Speech Corpus (NIST Speech Disc 1-1). See the TIMIT CD-ROM companion booklet,
NISTIR 4930, pp. 29-31, for a description of the phone set. As with the continuous
speech portion of RM1, this data is subsetted into speaker-independent and speaker-dependent
partitions. These data sets are further partioned into training, development-test
and evaluation-test subsets. See the "readme.doc" file in the top-level directory
for more information about the data. Texas Instruments recruited the subjects and
collected the speech. The National Institute of Standards and Technology (NIST) segmented
the waveforms, generated the time-aligned transcriptions and produced this release.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S39
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630926
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S40
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 493-178-172-435-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Frontiers in Speech Processing 94
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S40
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This CD reflects the cooperative efforts of 28 researchers who attended the 1994 summer
workshop in speech processing hosted by the Center for Computer Aids for Industrial
Productivity (CAIP) at Rutgers University and sponsored by the National Security Agency.
The workshop was an outgrowth of summers at the Center for Communication Research
in Princeton (CCR-P) and targeted problems concerning general-purpose speech recognition
with particular emphasis on front end processing. The project was held from July 6th
to August 13th and utilized extensive computational resources: both equipment native
to CAIP and additional hardware acquired for the workshop.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Flanagan, James
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S40
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631078
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S41
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 759-953-194-215-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
VAHA (POLYPHONE II)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S41
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Voice Across Hispanic America (VAHA) is a corpus of Spanish telephone speech, recorded
digitally from 915 native speakers of Spanish in various parts of the United States.
With nearly 39,000 recorded and transcribed utterances, VAHA will be useful for a
variety of research studies, but it is intended primarily for speech technology research
and development in telecommunications applications. It is patterned after Macrophone
(1), an American English corpus (LDC94S21) which is widely used for this purpose.
*Data* This corpus was collected by Texas Instruments in Dallas, TX for the Linguistic
Data Consortium.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muthusamy, Yeshwant K.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S41
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630276
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC94S14D
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 764-172-920-528-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Air Traffic Control DFW
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC94S14D
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded speech for
use in supporting research and development activities in the area of robust speech
recognition in domains similar to air traffic control (several speakers, noisy channels,
relatively small vocabulary, constrained languaged, etc.) The audio data on these
discs is composed of voice communication traffic between various controllers and pilots.
*Data* The audio files are 8 KHz, 16-bit linear sampled data, representing continuous
monitoring, without squelch or silence elimination, of a single FAA frequency for
one to two hours. There are also files which indicate the amplitude of the received
AM carrier signal at 10 msec. intervals. Full transcripts, including the start and
end times of each transmission, are provided for each audio file. Each flight is identified
by its flight number. ATC0 consists of three subcorpora, one for each airport in which
the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS)
and Washington National (DCA). The complete set contains approximately 70 hours of
controller and pilot transmissions collected via antennas and radio receivers which
were located in the vicinity of the respective airports. Detailed information regarding
the collection process and the equipment used can be found on each disc in the file,
"atc.doc" in the "doc" directory. The ATC0 Corpus was collected by Texas Instruments
under contract to DARPA. It was produced on CD-ROM by the National Institute of Standards
and Technology for distribution by the Linguistic Data Consortium.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, John J.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC94S14D
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630616
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S46
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 491-813-496-892-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND American English-Non-Southern Dialect
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S46
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of non-Southern dialects of American English. All calls are domestic and
were placed inside the continental United States, Canada, Puerto Rico, or the Dominican
Republic. Callers in the "non-Southern" (or "general") collection of CALLFRIEND American
English appear to come from a wide geographic range, based on their own reports of
where they were raised (some identified their origins as being in the southeastern
U.S.). Regardless of their geographic or ethnic backgrounds, the feature they share
is the clear absence of a vowel quality pattern that would distinguish them as speakers
of a "Southern" dialect. Some information was inadvertently left out of the speaker
information table and the call information table. Copies of these files are available
here at CALLINFO.TBL and SPKRINFO.TBL.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S46
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630624
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S47
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 405-742-783-280-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND American English-Southern Dialect
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S47
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Southern American English. All calls are domestic and were placed inside
the continental United States, Canada, Puerto Rico or the Dominican Republic. Callers
in the "Southern" collection of CALLFRIEND American English were identified primarily
on the basis of vowel quality patterns that are common among native speakers raised
in the southeastern United States (from Texas eastward to the Atlantic coast and from
Virginia and Kentucky southward to the Gulf of Mexico). This category also includes
a small number of African-American speakers, whose geographic origins may be more
dispersed, but who share some of the vowel quality patterns distinctive of Southern
white speakers. (Of course, other dialect features involving phonology, syntax and
prosody, serve to differentiate these two subgroups within the "Southern" collection.)
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S47
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u fre d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630632
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S48
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 127-055-636-719-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
fre
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
fra
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND Canadian French
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S48
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
Canadian speakers of French. All calls are domestic and were placed inside the continental
United States and Canada.
LANGUAGE NOTE
- Language note:
Content in French. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S48
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630640
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S49
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 635-815-047-587-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND Egyptian Arabic
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S49
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of the Egyptian dialect of Arabic. All calls are domestic and were placed
inside the continental United States and Canada.
LANGUAGE NOTE
- Language note:
Content in Egyptian Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S49
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u per d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630659
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S50
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 658-073-786-076-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
per
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
pes
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S50
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Farsi. All calls are domestic and were placed inside the continental United
States and Canada.
LANGUAGE NOTE
- Language note:
Content in Iranian Persian. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S50
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u ger d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630667
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S51
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 559-775-855-428-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S51
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of German. All calls are domestic and were placed inside the continental
United States and Canada.
LANGUAGE NOTE
- Language note:
Content in German. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S51
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u hin d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630675
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S52
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 651-156-740-657-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
hin
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
hin
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S52
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Hindi. All calls are domestic and were placed inside the continental United
States and Canada.
LANGUAGE NOTE
- Language note:
Content in Hindi. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S52
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630683
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S53
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 016-611-126-563-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND Japanese
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S53
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Japanese. All calls are domestic and were placed inside the continental
United States and Canada.
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Japanese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Japanese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Japanese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S53
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630691
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S54
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 501-796-595-536-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S54
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project was designed to support the development of language identification
technology. *Data* The corpus consists of 60 telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Korean. All calls are domestic and were placed inside the continental
United States and Canada.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S54
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630705
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S55
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 608-636-717-091-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND Mandarin Chinese-Mainland Dialect
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S55
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supported the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Mandarin Chinese from Mainland China. All calls are domestic and were
placed inside the continental United States and Canada. Callers in the "Mainland"
and "Taiwan" collections of CALLFRIEND Mandarin were identified primarily on the basis
of specific attributes in their speech characteristic of geographic origin.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S55
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630713
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S56
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 804-964-527-007-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND Mandarin Chinese-Taiwan Dialect
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S56
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Mandarin Chinese from Taiwan. All calls are domestic and were placed inside
the continental United States and Canada. Callers in the "Mainland" and "Taiwan" collections
of CALLFRIEND Mandarin were identified primarily on the basis of specific attributes
in their speech characteristic of geographic origin.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S56
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630721
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S57
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 060-146-980-446-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND Spanish-Caribbean Dialect
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S57
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Spanish from Caribbean countries. All calls are domestic and were placed
inside the continental United States, Canada, Puerto Rico, or the Dominican Republic.
Conversations were labeled as either "Caribbean" or "non-Caribbean" based on particular
attributes in the speech of the participants. Callers in the "Caribbean" and "non-Caribbean"
collections of CALLFRIEND Spanish were identified primarily on the basis of consonant
quality patterns, specifically, word-final "s."
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S57
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u spa d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S58
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 493-342-336-930-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND Spanish-Non-Caribbean Dialect
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S58
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Spanish from non-Caribbean countries. All calls are domestic and were
placed inside the continental United States, Canada, Puerto Rico, or the Dominican
Republic. Conversations were labeled as either "Caribbean" or "non-Caribbean" based
on particular attributes in the speech of the participants. Callers in the "Caribbean"
and "non-Caribbean" collections of CALLFRIEND Spanish were identified primarily on
the basis of consonant quality patterns, specifically, word-final "s."
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S58
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u tam d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630748
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S59
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 954-950-050-409-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
tam
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tam
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S59
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Tamil. All calls are domestic and were placed inside the continental United
States and Canada.
LANGUAGE NOTE
- Language note:
Content in Tamil. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S59
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u vie d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630756
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S60
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 692-966-307-809-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
vie
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
vie
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND Vietnamese
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S60
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLFRIEND project supports the development of language identification technology.
*Data* The corpus consists of 60 unscripted telephone conversations, lasting between
5-30 minutes. The corpus also includes documentation describing speaker information
(sex, age, education, callee telephone number) and call information (channel quality,
number of speakers). For each conversation, both the caller and callee are native
speakers of Vietnamese. All calls are domestic and were placed inside the continental
United States and Canada.
LANGUAGE NOTE
- Language note:
Content in Vietnamese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Vietnamese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Vietnamese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Vietnamese
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S60
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630594
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S61
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 678-642-946-413-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1996 Speaker Recognition Benchmark
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S61
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus, which is a subset of the Switchboard-1 (LDC93S7) corpus, was used in
NISTs 1996 Speaker Recognition Evaluation. The focus of this evaluation was on detection
of the presence of a hypothesized target speaker, given a segment of conversational
speech over the telephone. *Data* The corpus consists of one Development Data disc
and two Evaluation Data discs. Both sets include training and test segments. The Development
Data includes both training and test segments for about 45 male and 45 female speakers.
The training data consists of about four one minute segments of speech data for each
target speaker. The test data contains shorter segments of speech data (three, 10,
and 30 seconds) that were taken from different conversations for each speaker. The
Evaluation Data includes about 20 male and 20 female target speakers and 200 male
and 200 female non-target speakers. All of these speakers are different from the speakers
in the Development Data set. Training data is supplied for each of the target speakers,
in the same manner as the Development Data. Test data is supplied for both the target
and the non-target speakers, in the same manner as the Development Data.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S61
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630934
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S64
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 734-085-058-125-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 0 Complete
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S64
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) This corpus is available as a web download now, however
for purposes of publication by the LDC, the corpus was organized onto 40 CD-ROMs;
the partitioning of the data files have been done primarily by channel (20 CD-ROMs
each for channel 0 and channel 1) and secondarily by category of prompts. These prompts
include: Description Number of items Control Words: Banking Services 13 Word Processors
24 Home Electronic Equipment 26 Digits: Isolated Digits 15 Four Digit Sequences 35
City Names: 100 a phonetically-rich subset of common Japanese city names Monosyllables:
110 all Japanese monosyllables plus several used to pronounce foreign words JEIDA/JCSD-Channel
0 and JEIDA/JCSD-Channel 1 can each be ordered as complete sets. Components of the
corpus can also be purchased as outlined below: Price Set-of Description Catalog ID
2000 5 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600 1 JEIDA/JCSD-Channel 0 City Names
LDC96S64-1 400 1 JEIDA/JCSD-Channel 0 Control Words LDC96S64-2 100 1 JEIDA/JCSD-Channel
0 Isolated Digits LDC96S64-3 300 1 JEIDA/JCSD-Channel 0 Four Digit Seq. LDC96S64-4
600 1 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000 20 JEIDA/JCSD-Channel 1 (Complete)
LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names LDC96S65-1 500 4 JEIDA/JCSD-Channel
1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel 1 Isolated Digits LDC96S65-3 300
3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4 600 6 JEIDA/JCSD-Channel 1 Monosyllables
LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S64
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630942
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S64-1
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 901-478-179-194-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 0 City Names
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S64-1
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of Description Catalog ID 2000 5 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
1 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 1 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 1 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 1 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names
LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel
1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4
600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S64-1
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630950
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S64-2
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 223-532-296-030-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 0 Control Words
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S64-2
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of Description Catalog ID 2000 5 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
1 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 1 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 1 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 1 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names
LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel
1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4
600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S64-2
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630969
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S64-3
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 131-941-020-714-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 0 Isolated Digits
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S64-3
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
5 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 1 JEIDA/JCSD-Channel 1 City Names LDC96S65-1
500 1 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel 1 Isolated
Digits LDC96S65-3 300 1 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4 600 1 JEIDA/JCSD-Channel
1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S64-3
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630977
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S64-4
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 091-972-696-859-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 0 Four Digit Sequences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S64-4
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of Description Catalog ID 2000 5 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
1 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 1 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 1 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 1 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names
LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel
1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4
600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S64-4
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630985
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S64-5
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 298-496-720-875-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 0 Mono Syllables
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S64-5
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of Description Catalog ID 2000 5 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
1 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 1 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 1 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 1 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names
LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel
1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4
600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S64-5
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630993
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S65
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 324-094-816-326-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 1 Complete
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S65
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Associations (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of DescriptionCatalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names
LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel
1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4
600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S65
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636401
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 112-444-010-598-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets was developed
by NIST Multimodal Information Group. This release contains the evaluation sets (source
data and human reference translations), DTD, scoring software, and evaluation plans
for the Arabic-to-English and Chinese-to-English progress test sets for the NIST OpenMT
2008, 2009, and 2012 evaluations. The test data remained unseen between evaluations
and was reused unchanged each time. The package was compiled, and scoring software
was developed, at NIST, making use of Chinese and Arabic newswire and web data and
reference translations collected and developed by the Linguistic Data Consortium (LDC).
The objective of the OpenMT evaluation series is to support research in, and help
advance the state of the art of, machine translation (MT) technologies -- technologies
that translate text between human languages. Input may include all forms of text.
The goal is for the output to be an adequate and fluent translation of the original.
The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual
Information Detection, Extraction) program. Beginning with the 2006 evaluation, the
evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations
provide an important contribution to the direction of research efforts and the calibration
of technical capabilities in MT. The OpenMT evaluations are intended to be of interest
to all researchers working on the general problem of automatic translation between
human languages. To this end, they are designed to be simple, to focus on core technology
issues and to be fully supported. For more general information about the NIST OpenMT
evaluations, please refer to the NIST OpenMT website. This evaluation kit includes
a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality
score for one (or more) MT systems. The script works by comparing the system output
translation with a set of (expert) reference translations of the same source text.
Comparison is based on finding sequences of words in the reference translations that
match word sequences in the system output translation. LDC has released the following
associated corpora: * NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21)
* NIST 2009 Open Machine Translation (OpenMT) Evaluation (LDC2010T23) * NIST 2012
Open Machine Translation (OpenMT) Evaluation (LDC2013T03) *Data* This release contains
2,748 documents with corresponding source and reference files, the latter of which
contains four independent human reference translations of the source data. The source
data is comprised of Arabic and Chinese newswire and web data collected by LDC in
2007. The table below displays statistics by source, genre, documents, segments and
source tokens. Source Genre Documents Segments Source Tokens Arabic Newswire 84 784
20039 Arabic Web Data 51 594 14793 Chinese Newswire 82 688 26923 Chinese Web Data
40 682 19112 The token counts for Chinese data are character counts, which were obtained
by counting tokens matching the UNICODE-based regular expression w. The Python re
module was used to obtain those counts. The data in this package are in XML format
compliant with the included DTD.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Arabic, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631000
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S65-1
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 176-797-886-186-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 1 City Names
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S65-1
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names
LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel
1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4
600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S65-1
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631019
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S65-2
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 091-625-046-104-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 1 Control Words
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S65-2
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names
LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel
1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4
600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S65-2
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631027
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S65-3
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 034-730-028-932-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 1 Isolated Digits
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S65-3
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names
LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel
1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4
600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S65-3
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631035
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S65-4
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 150-660-810-991-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 1 Four Digit Sequences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S65-4
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names
LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel
1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4
600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S65-4
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631043
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96S65-5
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 071-929-364-754-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JEIDA/JCSD-Channel 1 Mono Syllables
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96S65-5
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Japanese Electronic Industry Development Association's (JEIDA) Common Speech Data
Corpus (JCSD) was prepared by Jonathan Hamaker, Richard J. Duncan and Joe Picone of
the Institute for Signal and Information Processing at Mississippi State University.
*Data* This collection consists of high-fidelity recordings of 150 native speakers
of Japanese; each speaker produces four repetitions of 323 short prompts, including
city names, control words, monosyllabic words, isolated digits and strings of four
digits. Each reading session was recorded with two microphones, yielding two channels
that differ in audio quality for each utterance. Channel 0 (LDC96S64) contains data
recorded with a standard dynamic microphone---a Sanken MU-2C microphone. Channel 1
(LDC96S65) contains data recorded simultaneously with a condenser microphone that
presumably varied from site to site and is available separately. A summary of the
size and content of the corpus is given below: number of speakers 150 speakers males
75 females 75 range of speaker age 10 yrs. to 70 yrs. number of items per speaker
323 items isolated digits 15 four digit sequences 35 city names 100 monosyllables
110 control words (set A) 13 control words (set B) 24 control words (set C) 26 number
of repetitions per item 4 repetitions total number of utterances 193,763 utterances
(per channel) sample frequency 16 kHz sample type 16-bit linear number of microphones
2 (dynamic and condenser) For purposes of publication by the LDC, the corpus has been
organized onto 40 CD-ROMs; the partitioning of the data files have been done primarily
by channel (20 CD-ROMs each for channel 0 and channel 1) and secondarily by category
of prompts. These prompts include: Description Number of items Control Words: Banking
Services 13 Word Processors 24 Home Electronic Equipment 26 Digits: Isolated Digits
15 Four Digit Sequences 35 City Names: 100 a phonetically-rich subset of common Japanese
city names Monosyllables: 110 all Japanese monosyllables plus several used to pronounce
foreign words JEIDA/JCSD-Channel 0 and JEIDA/JCSD-Channel 1 can each be ordered as
complete sets. Components of the corpus can also be purchased as outlined below: Price
Set-of Description Catalog ID 2000 20 JEIDA/JCSD-Channel 0 (Complete) LDC96S64 600
6 JEIDA/JCSD-Channel 0 City Names LDC96S64-1 400 4 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-2 100 1 JEIDA/JCSD-Channel 0 Isolated Digits LDC96S64-3 300 3 JEIDA/JCSD-Channel
0 Four Digit Seq. LDC96S64-4 600 6 JEIDA/JCSD-Channel 0 Monosyllables LDC96S64-5 2000
20 JEIDA/JCSD-Channel 1 (Complete) LDC96S65 600 6 JEIDA/JCSD-Channel 1 City Names
LDC96S65-1 500 4 JEIDA/JCSD-Channel 1 Control Words LDC96S65-2 100 1 JEIDA/JCSD-Channel
1 Isolated Digits LDC96S65-3 300 3 JEIDA/JCSD-Channel 1 Four Digit Seq. LDC96S65-4
600 6 JEIDA/JCSD-Channel 1 Monosyllables LDC96S65-5
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hamaker, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duncan, Richard J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picone, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Itahashi, Shuichi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Japan Electronic Industry Development Association (JEIDA)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96S65-5
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631051
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 171-247-873-339-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Message Understanding Conference (MUC) 6 Additional News Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Message Understanding Conference (MUC) 6 Additional News Text was produced by Linguistic
Data Consortium (LDC) catalog number LDC96T10 and ISBN 1-58563-105-1. In the 1990s,
the MUC evaluations funded the development of metrics and statistical algorithms to
support government evaluations of emerging information extraction technologies. Additional
information from NIST can be found at http://www.itl.nist.gov/iaui/894.02/related_projects/muc.
*Data* This corpus contains additional training data, which had been tagged, but not
annotated. Both the MUC 6 and the MUC 6 Additional News Text are necessary in order
to replicate the evaluation. All the materials are published as received from the
corpus creators, without any quality control being done at the LDC (the only difference
is that the files have been uncompressed).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chinchor, Nancy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sundheim, Beth
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631485
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 184-170-097-975-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
COMLEX Syntax Text Corpus Version 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Macleod, Catherine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meyers, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grishman, Ralph
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630810
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 938-049-797-840-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Mandarin Chinese Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The text component of the CALLHOME Mandarin Chinese package includes transcripts and
documentation files. *Data* The transcripts cover a contiguous five or ten-minute
segment taken from 120 unscripted telephone conversations between native speakers
of Mandarin Chinese. The transcripts are timestamped by speaker turn for alignment
with the speech signal and are provided in standard orthography. In addition to transcript
files, this corpus contains full documentation on the transcription conventions and
format. Auditing and demographic information on the speakers represented in the transcripts
(including gender, channel quality and so on) are also included. The data is encoded
as "gb2312" (a.k.a. "euc-cn"). The corpus of telephone speech (LDC96S34) are available
separately, as is an associated lexicon (LDC96L15).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wheatley, Barbara
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630845
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 979-631-848-400-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Spanish Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The text component of the CALLHOME Spanish package includes transcripts and documentation
files. *Data* The transcripts cover a contiguous five or ten-minute segment taken
from 120 unscripted telephone conversations between native speakers of Spanish. The
transcripts are timestamped by speaker turn for alignment with the speech signal and
are provided in standard orthography. In addition to transcript files, this corpus
contains full documentation on the transcription conventions and format. Auditing
and demographic information on the speakers represented in the transcripts (including
gender, channel quality and so on) are also included. This corpus is distributed throughout
the LDC's fFTP server. The corpus of telephone speech (LDC96S35) are available separately,
as is an associated lexicon (LDC96L16).
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wheatley, Barbara
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1996 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585630780
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC96T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 476-552-220-214-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Japanese Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1996]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC96T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The text component of the CALLHOME Japanese package includes transcripts and documentation
files. *Data* The transcripts cover a contiguous five or ten-minute segment taken
from 120 unscripted telephone conversations between native speakers of Japanese. The
transcripts are timestamped by speaker turn for alignment with the speech signal and
are provided in standard orthography. In addition to transcript files, this corpus
contains full documentation on the transcription conventions and format. Auditing
and demographic information on the speakers represented in the transcripts (including
gender, channel quality and so on) are also included. This corpus is distributed throughout
the LDC's FTP server. The corpus of telephone speech (LDC96S37) are available seperately,
as is an associated lexicon and transducer (LDC96L17).
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Japanese language
- Form subdivision:
Databases.
- Geographic subdivision:
North America
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Japanese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wheatley, Barbara
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kaneko, Masayo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kobayashi, Megumi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC96T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u ger d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631167
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97L18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 415-724-431-827-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME German Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97L18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME German corpus collection includes a lexical component. The CALLHOME German
lexicon consists of 318,807 words. Of these, 315,503 words are adapted from the CELEX
German lexicon produced by The Centre for Lexical Information, Max Planck Institute
for Psycholinguistics in Nijmigen and 3,304 additional words come from the 80 training
and 20 development test (devtest) transcripts (ten minutes each) from the LDC German
CALLHOME telephone speech corpus. *Data* The German lexicon contains tab-separated
information fields with orthographic, morphological, phonological, stress, source
and frequency information for each word. Here is a sample page from the lexicon. The
transcripts and documentation (LDC97T15) are available separately, as is a corpus
of telephone speech (LDC97S43).
LANGUAGE NOTE
- Language note:
Content in German. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Karins, Krisjanis
ADDED ENTRY--PERSONAL NAME
- Personal name:
MacIntyre, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brandmair, Monika
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lauscher, Susanne
ADDED ENTRY--PERSONAL NAME
- Personal name:
McLemore, Cynthia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97L18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631108
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97L20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 119-159-358-214-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME American English Lexicon (PRONLEX)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97L20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME English collection includes a lexical component. The CALLHOME American
English Lexicon was originally distributed under the name COMLEX Pronouncing Lexicon,
or PRONLEX. Organizations that have already received PRONLEX will not be required
to purchase the CALLHOME American English Lexicon. *Data* The latest version of PRONLEX
contains 90,988 lexical entries and includes coverage of WSJ30, WSJ64, Switchboard
and CALLHOME English. (WSJ30K and WSJ64K are word lists selected from several years
of Wall Street Journal texts used in recent ARPA Continuous Speech Recognition corpora.
Switchboard is a three million word corpus of telephone conversations on a variety
of topics.) The PRONLEX documentation describes the principles observed for word transcription.
Although predictable variation in pronunciation due to dialect or variable reduction
has not been notated in the lexicon itself, the documentation notes systematic dialectal
variants, which may be generated by rule. In addition, alternate pronunciations are
given for words whose pronunciation varies by part of speech (e.g., abstrAct, Abstract),
or in less systematic but salient ways (especially names). Classes of exceptions to
the transcription principles, such as names, function, words and foreign words, are
tagged. Here is a sample page. The transcripts and documentation (LDC97T14) are available,
as well as a corpus of telephone speech (LDC97L20).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kingsbury, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
McLemore, Cynthia
ADDED ENTRY--PERSONAL NAME
- Personal name:
MacIntyre, Robert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97L20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631116
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97S42
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 952-976-147-406-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME American English Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97S42
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CALLHOME American English Speech was developed by the Linguistic Data Consortium (LDC)
and consists of 120 unscripted 30-minute telephone conversations between native speakers
of English. All calls originated in North America; 90 of the 120 calls were placed
to various locations outisde of North America, while the remaining 30 calls were made
within North America. Most participants called family members or close friends. *Data*
This corpus contains speech data files with documentation describing their contents
and format along with the software packages needed to uncompress the speech data.
Corresponding transcripts and documentation (LDC97T14) are available separately, as
is an associated lexicon (LDC97L20).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97S42
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u ger d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631175
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97S43
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 439-134-097-495-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME German Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97S43
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME German corpus of telephone speech consists of 100 unscripted telephone
conversations between native speakers of German. *Data* All calls originated in North
America and were placed to locations overseas (typically Europe). Most participants
called family members or close friends. This corpus contains speech data files ONLY,
along with the minimal amount of documentation needed to describe the contents and
format of the speech files and the software packages needed to uncompress the speech
data. The transcripts and documentation (LDC97T15) are available separately, as is
an associated lexicon (LDC97L18).
LANGUAGE NOTE
- Language note:
Content in German. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97S43
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631094
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97S44
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 876-519-945-577-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1996 English Broadcast News Speech (HUB4)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97S44
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts
from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding
transcripts. The primary motivation for this collection is to provide training data
for the DARPA "HUB4" Project on continuous speech recognition in the broadcast domain.
*Data* The speech files are available as a training data set, development data and
evaluation data. The following programs are represented in this corpus: * ABC Nightline
* ABC World Nightly News * ABC World News Tonight * CNN Early Edition * CNN Early
Prime News * CNN Headline News * CNN Prime Time News * CNN The World Today * CSPAN
Washington Journal * NPR All Things Considered * NPR Marketplace Transcripts have
been made of all recordings in this publication, manually time aligned to the phrasal
level, annotated to identify boundaries between news stories, speaker turn boundaries
and gender information about the speakers. The released version of the transcripts
is in SGML format and there is accompanying documentation and an SGML DTD file, included
with the transcription release. The transcripts are available via FTP.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97S44
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631140
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97S45
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 102-150-894-143-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Egyptian Arabic Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97S45
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The CALLHOME Egyptian Arabic corpus of telephone speech consists of 120 unscripted
telephone conversations between native speakers of Egyptian Colloquial Arabic (ECA),
the spoken variety of Arabic found in Egypt. The dialect of ECA that this dictionary
represents is Cairene Arabic. *Data* All calls, which lasted up to 30 minutes, originated
in North America and were placed to locations overseas (typically Egypt). Most participants
called family members or close friends. This corpus contains speech data files ONLY,
along with the minimal amount of documentation needed to describe the contents and
format of the speech files and the software packages needed to uncompress the speech
data. The transcripts and documentation (LDC97T19) are available separately, as is
an associated lexicon (LDC99L22).
LANGUAGE NOTE
- Language note:
Content in Egyptian Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97S45
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631213
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97S62
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 988-076-156-109-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Switchboard-1 Release 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97S62
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Switchboard-1 Telephone Speech Corpus (LDC97S62) consists of approximately 260
hours of speech and was originally collected by Texas Instruments in 1990-1, under
DARPA sponsorship. The first release of the corpus was published by NIST and distributed
by the LDC in 1992-3. Since that release, a number of corrections have been made to
the data files as presented on the original CD-ROM set and all copies of the first
pressing have been distributed. Switchboard is a collection of about 2,400 two-sided
telephone conversations among 543 speakers (302 male, 241 female) from all areas of
the United States. A computer-driven robot operator system handled the calls, giving
the caller appropriate recorded prompts, selecting and dialing another person (the
callee) to take part in a conversation, introducing a topic for discussion and recording
the speech from the two subjects into separate channels until the conversation was
finished. About 70 topics were provided, of which about 50 were used frequently. Selection
of topics and callees was constrained so that: (1) no two speakers would converse
together more than once and (2) no one spoke more than once on a given topic. *Data*
In this release, assembled and published by the LDC, all known errors affecting the
original publication of speech files were corrected. In addition, modifications have
been made to the contents of the NIST Sphere headers of all speech files, to identify
each file as being part of the new release and to make the usage of the sample_count
header field consistent with standard Sphere usage. (In particular, the sample_count
field should reflect the number of samples on each channel in the file. In the initial
release, this field was improperly set to be the total number of samples in both channels
of the file this has been corrected in the new release.) Since the 1997 release, the
Switchboard transcripts have been carefully revised at The Institute for Signal and
Information Processing (ISIP) and additional problems have been discovered and patched.
Three speech files, part of the original release, were inadvertently left off the
1997 revision. After corpus users noted some problems in the original speaker attribution
table, LDC audited the problem calls and corrected the attributions. The latest version
of ISIP transcriptions, the ISIP update of the ICSI phonetic transcriptions, and corrected
word alignments are all available at ISIP. The LDC makes the transcript summaries
available via http. Researchers have used SWB-1 data for various annotation projects
including discourse annotation/speech acts, part-of-speech tagging and parsing, up-to-date
orthographic transcriptions, and phonetic transcriptions. This summary documents which
files have been used for the various annotations. In addition to the index of these
file characteristics, there is also a table detailing speaker attributes.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Godfrey, John J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Holliman, Edward
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97S62
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631205
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97S63
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 566-795-587-797-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
The CMU Kids Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97S63
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This database is comprised of sentences read aloud by children. It was originally
designed in order to create a training set of children's speech for the SPHINX II
automatic speech recognizer for its use in the LISTEN project at Carnegie Mellon University.
*Data* The children range in age from six to eleven (see details below) and were in
first through third grades (the 11-year-old was in 6th grade) at the time of recording.
There were 24 male and 52 female speakers. Although the girls outnumber the boys,
we feel that the small difference in vocal tract length between the two at this age
should make the effect of this imbalance negligible. There are 5,180 utterances in
all. The speakers come from two separate populations. Since the LISTEN reading coach
needed good examples of reading aloud, it was decided that the majority of the speakers
should be "good" readers. They were recorded in the summer of 1995 and were enrolled
in either the Chatham College Summer Camp or the Mount Lebanon Extended Day Summer
Fun program in Pittsburgh. They were recorded on-site. This set will hereafter be
called SUM95. There are 44 speakers and 3,333 utterances in this set. The LISTEN system
also needed examples of errorful reading and dialectic variants. The readers who supplied
this type of speech come from a school which has a high population of children who
are at risk of growing up poor readers and who could therefore benefit from any reading
tutor or other system built upon this database. They come from Fort Pitt School in
Pittsburgh and were recorded in April 1996. This subset will be referred to as FP.
There are 32 speakers and 1,847 utterances in this set. The list of speakers, the
set they are in and the number of sentences per speaker can be found in the "tables"
directory, in the file named "speaker.tbl." It should be noted that although there
will be some dialectal variation in the speech of the SUM95 subset, the speech of
the FP subset gives us a very good representation of dialects of the children that
may be targeted for the LISTEN system. However, the user should be aware that the
speakers' dialect partly reflects what is locally called "Pittsburghese."
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Eskenazi, Maxine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mostow, Jack
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97S63
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631086
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97S66
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 827-422-903-193-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1996 English Broadcast News Dev and Eval (HUB4)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97S66
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts
from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding
transcripts. The primary motivation for this collection is to provide training data
for the DARPA "HUB4" Project on continuous speech recognition in the broadcast domain.
*Data* The speech files are available in a 19 disc training data set with one additional
disc of development data and an additional disc of evaluation data. The following
programs are represented in this corpus: * ABC Nightline * ABC World Nightly News
* ABC World News Tonight * CNN Early Edition * CNN Early Prime News * CNN Headline
News * CNN Prime Time News * CNN The World Today * CSPAN Washington Journal * NPR
All Things Considered * NPR Marketplace Transcripts have been made of all recordings
in this publication, manually time aligned to the phrasal level, annotated to identify
boundaries between news stories, speaker turn boundaries, and gender information about
the speakers. The released version of the transcripts is in SGML format and there
is accompanying documentation and an SGML DTD file, included with the transcription
release. The transcripts are available via FTP.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alabiso, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97S66
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631191
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 690-427-158-676-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
DSO Corpus of Sense-Tagged English
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus contains sense-tagged word occurrences for 121 nouns and 70 verbs which
are among the most frequently occurring and ambiguous words in English. These occurrences
are provided in about 192,800 sentences taken from the Brown corpus and the Wall Street
Journal and have been hand tagged by students at the Linguistics Program of the National
University of Singapore. WordNet 1.5 sense definitions of these nouns and verbs were
used to identify a word sense for each occurrence of each word. *Data* In addition
to providing the word occurrences in their full sentential context, the corpus includes
complete listings of the WordNet 1.5 sense definitions used in the tagging. The following
example illustrates the format of a sentence with a sense tag for the word "action,"
followed by the corresponding WordNet1.5 sense definition: ca01.db #020 `` These >>
actions 8 proceeding, legal proceeding, judicial proceeding, proceedings -- (the institution
of a legal action) => due process, due process of law -- (the administration of justice
according to established rules and principles) => group action -- (action taken by
a group of people) => act, human action, human activity -- (something that people
do or cause to happen) (In the actual corpus, all tagged occurrences of a given noun
or verb are stored together in one file, with each full sentence on one line; all
noun and verb word sense definitions are stored together in two separate files.) This
sense tagged corpus was provided by Hwee Tou Ng of the Defence Science Organisation
(DSO) of Singapore. It was first reported in the following paper at ACL-96: "Integrating
Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach,"
by Hwee Tou Ng and Hian Beng Lee, in Proceedings of the 34th Annual Meeting of the
Association for Computational Linguistics, pages 40-47, Santa Cruz, California, USA,
June 1996. ( http://xxx.lanl.gov/abs/cmp-lg/9606032 )
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ng, Hwee Tou
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Hian Beng
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631124
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 707-070-566-734-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME American English Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The text component of the CALLHOME English package includes time-aligned transcripts
and documentation files for 120 unscripted telephone conversations between native
speakers of English; a separate catalog entry, (LDC97S42) provides the speech data
for these conversations, which are partitioned into separate subdirectories for "training"
(80 conversations), "development test set" (20 conversations) and "evalutation test
set" (20 conversations). *Data* The transcripts cover a contiguous ten minute segment
of each call in the training and development test sets, and a five minute segment
of each call in the evaluation set, yielding a total of 18.3 hours of transcribed
spontaneous speech, comprising about 230,000 words. The transcripts are timestamped
by speaker turn for alignment with the speech signal and are provided in standard
orthography. In addition to transcript files, this corpus contains full documentation
on the transcription conventions and format. Complete auditing information on the
speakers represented in the transcripts (including gender, channel quality and so
on) is also included. This corpus is distributed throughout the LDC's FTP server.
The corpus of telephone speech (LDC97S42) is available separately, as well as an associated
lexicon (LDC97L20).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kingsbury, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
McLemore, Cynthia
ADDED ENTRY--PERSONAL NAME
- Personal name:
McIntyre, Robert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u ger d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631183
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 067-875-636-088-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME German Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The text component of the CALLHOME German corpus package includes transcripts and
documentation files. The transcripts cover contiguous five or ten minute segments
taken from 100 unscripted telephone conversations between native speakers of German.
The transcripts are timestamped by speaker turn for alignment with the speech signal
and are provided in standard orthography. *Data* In addition to transcript files,
this corpus contains full documentation on the transcription conventions and format.
Complete auditing information on the speakers represented in the transcripts (including
gender, channel quality and so on) is also included. This corpus is distributed throughout
the LDC's FTP server. The corpus of telephone speech (LDC97S43) is available separately,
as well as an associated lexicon (LDC97L18). For a list of updates, user reports,
and other addenda, please go to LDC1997T15.
LANGUAGE NOTE
- Language note:
Content in German. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Karins, Krisjanis
ADDED ENTRY--PERSONAL NAME
- Personal name:
MacIntyre, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brandmair, Monika
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lauscher, Susanne
ADDED ENTRY--PERSONAL NAME
- Personal name:
McLemore, Cynthia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631159
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 356-881-507-091-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Egyptian Arabic Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The text component of the CALLHOME Egyptian Arabic package includes transcripts and
documentation files. The transcripts cover a contiguous five or ten minute segment
taken from 120 unscripted telephone conversations between native speakers of Egyptian
Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt. The dialect
of ECA that this dictionary represents is Cairene Arabic. *Data* The transcripts are
timestamped by speaker turn for alignment with the speech signal and are provided
in standard orthography. In addition to transcript files, this corpus contains full
documentation on the transcription conventions and format. Complete auditing information
on the speakers represented in the transcripts (including gender, channel quality
and so on) is also included. For a sample file, please click here. The corpus of telephone
speech (LDC97S45) is available separately, as is an associated lexicon (LDC99L22).
LANGUAGE NOTE
- Language note:
Content in Egyptian Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gadalla, Hassan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kilany, Hanaa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Arram, Howaida
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yacoub, Ashraf
ADDED ENTRY--PERSONAL NAME
- Personal name:
El-Habashi, Alaa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shalaby, Amr
ADDED ENTRY--PERSONAL NAME
- Personal name:
Karins, Krisjanis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rowson, Everett
ADDED ENTRY--PERSONAL NAME
- Personal name:
MacIntyre, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kingsbury, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
McLemore, Cynthia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631493
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC97T22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 444-268-955-648-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1996 English Broadcast News Transcripts (HUB4)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC97T22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts
from ABC, CNN, and CSPAN television networks and NPR and PRI radio networks with corresponding
transcripts. The primary motivation for this collection is to provide training data
for the DARPA "HUB4" Project on continuous speech recognition in the broadcast domain.
The speech files are available in a 19 disc training data set with one additional
disc of development data and an additional disc of evaluation data. The following
programs are represented in this corpus: * ABC Nightline * ABC World Nightly News
* ABC World News Tonight * CNN Early Edition * CNN Early Prime News * CNN Headline
News * CNN Prime Time News * CNN The World Today * CSPAN Washington Journal * NPR
All Things Considered * NPR Marketplace *Data* Transcripts have been made of all recordings
in this publication, manually time aligned to the phrasal level, annotated to identify
boundaries between news stories, speaker turn boundaries and gender information about
the speakers. The released version of the transcripts is in SGML format and there
is accompanying documentation and an SGML DTD file, included with the transcription
release. The transcripts are available via FTP.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alabiso, Jennifer
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC97T22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1994 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631477
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98L21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 639-666-285-650-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
COMLEX English Syntax Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1994]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98L21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This is a moderately broad coverage English lexicon (with about 38,000 lemmas) developed
at New York University under LDC sponsorship. It contains detailed information about
the syntactic characteristics of each lexical item and is particularly detailed in
its treatment of subcategorization (complement structures). *Data* In the current
dictionary, nouns have nine possible features and nine possible complements; adjectives
have seven features and 14 complements; verbs have five features and 92 complements.
The entries for 750 frequent verbs contain 100 tags each, where a tag includes: a
pointer to an instance of that verb in a corpus and the subcategorization appropriate
for that instance. Some references for the syntax and semantics work: Ralph Grishman,
Catherine Macleod and Adam Meyers. Comlex syntax: Building a computational lexicon.
Proc. 15th Int'l Conf. Computational Linguistics (COLING 94), Kyoto, Japan, August
1994. Macleod, Catherine, Adam Meyers and Ralph Grishman. The Influence of Tagging
on the Classification of Lexical Complements. Proc. 16th Int'l Conf. Computational
Linguistics (COLING 96), Copenhagen, Denmark, August 1996. Here is a sample page from
the lexicon.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Macleod, Catherine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meyers, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grishman, Ralph
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98L21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631302
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S67
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 866-042-083-505-7
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S67
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The HTIMIT corpus is a re-recording of a subset of the TIMIT corpus through different
telephone handsets. The aim was to create a corpus for the study of telephone transducer
effects on speech which minimized confounding factors, such as variable telephone
channels and background noise. HTIMIT was created by playing ten TIMIT sentences from
192 male and 192 females through a stereo loudspeaker into different transducers positioned
directly in front of the loudspeaker and digitizing the output from the transducers.
Ten (10) transducers (telephone handsets) were used. Most of these were not new; handsets
with obvious damage were not used, but in order to obtain some diversity with a limited
number of handsets, handsets were selected to have variable sound characteristics,
transducer designs or, in the case of electrets, different grill designs. Further
information about the handsets is provided in the corpus documentation. *Data* The
collection procedure was not ideal with respect to realism of sound transduction,
but it does allow for the collection of speech from a large number of speakers repeating
identical speech on each instance. Furthermore, coupled with the phonetic markings
from the original TIMIT corpus, HTIMIT offers the ability to study handset transducer
effects on speech recognition systems. To address the realism of the sound transduction
in HTIMIT, a second corpus using the same handsets but with live people speaking into
the handsets is also available. This corpus is called the Lincoln Laboratory Handset
Database (LLHDB) LDC93S68.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Reynolds, Douglas
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S67
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631361
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S68
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 842-109-246-373-3
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S68
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LLHDB consists of recordings of people speaking into ten different telephone handsets.
The aim was to create a corpus for the study of telephone transducer effects on speech
which minimized confounding factors, such as variable telephone channels and background
noise. LLHDB was created by having volunteers speak prompted and extemporaneous speech
into different transducers in a sound-proof room and directly digitizing the output
from the transducers on a SunSparc A/D at a 8kHz sampling rate and a 16-bit resolution.
*Data* There were three types of speech recorded for each handset. First, the speaker
read the "rainbow passage" [Nolan 83], a 97 word passage sometimes used in phonetic
research. Second, the speaker read ten sentences extracted from TIMIT (LDC93S1). Finally,
the speaker was asked to describe a photograph for approximately 40 seconds (a different
photograph was used for each handset). LLHDB contains speech from 53 speakers (24
males and 29 females) recruited from the laboratory. Because the same handsets are
used in both HTIMIT (LDC98S67) and LLHDB, it is possible to compare the effects of
the two different recording methods.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Reynolds, Douglas
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S68
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631310
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S69
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 333-068-970-015-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
HUB5 Mandarin Telephone Speech Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S69
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release of HUB5 Mandarin training data consists of 42 calls derived from the
CALLFRIEND Mandarin Chinese Mainland Dialect (Language ID) collection. The transcribed
data is intended as additional training data in support of the project on Large Vocabulary
Conversational Speech Recognition (LVCSR), also sponsored by the U.S. Department of
Defense. The transcripts cover a contiguous 5-30 minute segment taken from a recorded
conversation lasting up to 30 minutes. *Data* Speakers were solicited by the LDC to
participate in this telephone speech collection effort via the internet, publications
(advertisements) and personal contacts. A total of 200 call originators were found,
each of whom placed a telephone call via a toll-free robot operator maintained by
the LDC. Access to the robot operator was possible via a unique Personal Identification
Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in
the project. The participants were made aware that their telephone call would be recorded,
as were the call recipients. The call was allowed only if both parties agreed to being
recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion
of the call, the caller was paid $20 (in addition to making a free long-distance telephone
call). Each caller was allowed to place only one telephone call. They were given no
guidelines concerning what they should talk about. Once a caller was recruited to
participate, he/she was given a free choice of whom to call. Most participants called
family members or close friends. All calls originated in North America and were placed
to various locations within North America.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S69
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631337
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S70
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 755-936-160-383-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
HUB5 Spanish Telephone Speech Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S70
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release of HUB5 Spanish training data consists of 106 calls derived from the
CALLFRIEND Spanish (Language ID) collection. The transcripts cover a contiguous 10-30
minute segment taken from a recorded conversation lasting up to 30 minutes. These
calls were originally collected by the LDC in support of the project on Language Recognition,
sponsored by the U.S. Department of Defense. All of these calls are being designated
as additional training data for the project on Large Vocabulary Conversational Speech
Recognition (LVCSR) in Spanish. *Data* Speakers were solicited by the LDC to participate
in this telephone speech collection effort via the internet, publications (advertisements)
and personal contacts. A total of 200 call originators were found, each of whom placed
a telephone call via a toll-free robot operator maintained by the LDC. Access to the
robot operator was possible via a unique Personal Identification Number (PIN) issued
by the recruiting staff at the LDC when the caller enrolled in the project. Once a
caller was recruited to participate, he/she was given a free choice of whom to call.
Recruits were given no guidelines concerning what they should talk about. Most participants
called family members or close friends. All calls originated in North America and
were placed to various locations within North America, Puerto Rico or the Dominican
Republic. The participants were made aware that their telephone call would be recorded,
as were the call recipients. The call was allowed only if both parties agreed to being
recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion
of the call, the caller was paid $20 (in addition to making a free long-distance telephone
call). Each caller was allowed to place only one telephone call. HUB5 Spanish speech
and transcript data may be obtained by contacting the LDC
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S70
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S71
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 331-835-398-589-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 English Broadcast News Speech (HUB4)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S71
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release contains a total of 97 hours of recordings from radio and television
news broadcasts, gathered between June 1997 and February 1998. It has been prepared
to serve as a supplement to the 1996 Broadcast News Speech collection (consisting
of over 100 hours of similar recordings). The primary motivation for this collection
is to provide additional training data for the DARPA "HUB4" Project on continuous
speech recognition in the broadcast domain. *Data* Transcripts have been made of all
recordings in this publication, manually time aligned to the phrasal level, annotated
to identify boundaries between news stories, speaker turn boundaries and gender information
about the speakers. The transcription conventions are described in the file "transcrp.doc"
-- please note that this file describes the transcription methods by reference to
text formatting conventions used internally by the LDC during the transcription process.
The released version of the transcripts is in SGML format, comparable to the format
that was used in the 1996 Broadcast News Speech transcriptions and there is accompanying
documentation and an SGML DTD file, included with the transcription release.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S71
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631396
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S72
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 388-547-288-616-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Taiwanese Putonghua Speech and Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S72
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This set of data on Taiwanese accented Putonghua (PTH) was gathered by San Duanmu
at the University of Michigan. The data was recorded in Taiwan from December 1994
to January 1995. Taiwanese accented PTH refers to PTH spoken by people who were born
in Taiwan and whose first language is Taiwanese (Southern Min). *Data* A total of
40 speakers; ranging in age, education, birth place and family dialect; were recorded.
There were five two-speaker dialogues and 30 single-speaker monologues. The dialogues
were about 20 minutes each and the monologues were about 10 minutes each. Dialogues
were recorded on two tracks, one for each speaker. Monologues were recorded on one
track. The recordings were done in ordinary, but quiet rooms. The speakers were asked
in advance to speak in conversation style, without notes, on any topic they chose,
or no topic at all. Most speakers spoke spontaneously and the topic drifted freely.
Some speakers talked about their professional work in a rather formal way. One speaker
(#20, a public health official) used notes. Overall, the corpus provides an informative
sampling of variation in speech style. The recording tools consisted of a portable
DAT (Teac) which recorded at a 44.1 kHz sampling rate at 16 bits linear quantization.
The microphones were AudioTechnica lapel microphones with a preamp and XLR connection
to the DAT. The XLR helped low noise recordings and the AudioTechnica provided wide
bandwidth, flat response over the speech range of interest, was unidirectional to
minimize cross-talk and very light in comparison with standard microphones. Both single-speaker
monologues and two-speaker dialogues were recorded using this system on standard DAT
tape. For publication on CD-ROM, the original DAT recordings were downsampled to a
16 kHz sample rate. Before recording, all speakers read and signed the "Informed Consent
Form," which was written in Chinese and which largely followed the standard format
approved by the Human Subject Committee of the University of Michigan. The form stated
that the participation in the recording was entirely voluntary and that the speech
may be used for linguistic teaching and research purposes. The speech data are accompanied
by transcripts. The monologues have start and end time stamps. The five dialogues
are time stamped by speaker turn.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Duanmu, San
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wakefield, Gregory H.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hsu, Yi-ping
ADDED ENTRY--PERSONAL NAME
- Personal name:
Qui, Shan-ping
ADDED ENTRY--PERSONAL NAME
- Personal name:
Guevara, Rowena Cristina
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S72
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631256
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S73
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 271-025-922-959-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 Mandarin Broadcast News Speech (HUB4-NE)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S73
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This collection consists of 30 hours of broadcast news recordings from the following
sources: Voice of America (VOA), China Central TV (CCTV) and KAZN-AM, a commercial
radio station based in Los Angeles, CA. Of these three sources, the first two comprise
the bulk of the collection and are represented in roughly equal amounts. Only a relatively
small sample of KAZN-AM recordings is included, owing to the relatively high proportion
of unusable material in that source (e.g., commercials, local traffic reports). Corresponding
transcripts are released as 1997 Mandarin Broadcast News Transcripts (HUB4-NE) LDC98T24.
*Data* All recordings were made using a single channel and 16-KHz sample frequency.
Most files contain 30 minutes of recordings. There are some larger files consisting
of 60 minutes and 120 minutes of programming.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wu, Xuling
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yan, Yongmin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Qin, Zhoakai
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S73
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631272
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S74
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 684-931-706-325-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 Spanish Broadcast News Speech (HUB4-NE)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S74
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus contains a portion of the acoustic data designated as the training set
for the 1997 DARPA HUB4 Spanish Benchmark. It contains speech and transcripts of 30
hours of broadcast news from the following sources: Televisa, Univision and VOA. *Data*
All acoustic files are in NIST SPHERE format, without compression. The sample data
are 16-bit linear PCM, 16-KHz sample frequency, single channel. Most files contain
30 minutes of recorded material and some contain 60 or 120 minutes (approximately);
the sampling format requires roughly two megabytes (MB) per minute of recording, so
the file sizes are typically around 60 MB, with some files ranging up to 120 or 240
MB. The transcripts are in SGML format, using the same markup conventions that have
been applied to the other 1997 Broadcast News speech corpora (in English and Mandarin)
and are transmitted by FTP, not on the CD-ROMs with speech data.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S74
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631388
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S75
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 818-666-043-021-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Switchboard-2 Phase I
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S75
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Switchboard-2 Phase I consists of 3,638 5-minute telephone conversations involving
657 participants. This corpus was collected by the Linguistic Data Consortium (LDC),
in support of a project on Speaker Recognition sponsored by the U.S. Department of
Defense. This release consists of speech files only; these calls were not transcribed.
*Data* Speakers were solicited by the LDC to participate in this telephone speech
collection effort via the internet, publications (advertisements) and personal contacts.
Potential participants responded from all areas of the United States, although the
majority of the subjects were from the Mid-Atlantic area: (PA=303), (NJ=116), (NY=53),
(DE=13), (CT=12), (MD=14), (OH=13) and (MA=8). Most of the participants in SWB-2 Phase
I were college students from the following universities: Penn State University, University
of Delaware, University of Pennsylvania, Drexel University and Rutgers University.
Of the 657 participants, 358 were female and 299 were male. An LDC recruiter asked
all participants for the following demographic information: age, sex, years of completed
education, country of birth, city and state where raised. Each recruit was asked to
participate in at least ten five-minute phone calls. Ideally each participant would
receive five calls at a designated number and make five calls from phones with different
telephone numbers (ANI codes). The average subject participated in 11 conversations;
however, one gentleman participated in 64 calls. A suggested topic of discussion was
given (read by the automated operator), although participants could chat about whatever
they preferred. Each of the 657 participants placed their calls via a toll-free robot
operator maintained by the LDC. Access to the robot operator was possible via a unique
Personal Identification Number (PIN) issued by the recruiting staff at the LDC when
the caller enrolled in the project. Upon conclusion of the study all calls were audited
by LDC staff members. Particular attention was paid to PIN verification (matching
speaker with PIN), checking call duration and call quality. Upon completion of this
process checks were issued and mailed to participants.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S75
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631299
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S76
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 773-295-516-240-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1998 Speaker Recognition Benchmark
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S76
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1998 speaker recognition evaluation is part of an ongoing series of yearly benchmark
tests conducted by NIST. These tests are intended to provide a stable reference point
for measuring and comparing the performance of diverse methods for text-independent
speaker recognition over the telephone and should be of interest to all researchers
working in this area of speech technology development. The test sets and evaluation
protocols have been designed to be simple, to focus on core technology issues, to
be fully supported and to be accessible. *Data* In 1996 and 1997 handset variation
was featured as a prominent technical challenge to be addressed. While handset variation
remains a formidable challenge, the 1998 evaluation directs greatest attention toward
speaker recognition performance for the case in which both training and test data
are from the same source. The speech data were recorded by the LDC between January
and March 1997 most of the speakers recruited for this collection were college students
from the Great Lakes (Northern Midwest) region of the U.S.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S76
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631418
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98S77
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 074-386-777-466-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Voicemail Corpus Part I
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98S77
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus was created by: M. Padmanabhan, G. Ramaswamy, B. Ramabhadran, P. S. Gopalakrishnan
and C. Dunn *Data* This corpus consists of 1,801 messages, collected from volunteers
at various IBM sites in the United States, comprising the training data set and 42
messages in the development test set. The average voicemail message is 31 seconds
in duration and has about 100 words. Approximately 38% of the messages correspond
to male speakers the remainder correspond to females. All messages were transcribed
by IBM.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Padmanabhan, M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ramaswamy, G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ramabhadran, B.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gopalakrishnan, P.S.
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98S77
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631264
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98T24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 915-625-485-899-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 Mandarin Broadcast News Transcripts (HUB4-NE)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98T24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This collection consists of 30 hours of transcripts of Mandarin Chinese broadcast
news recordings from the following sources: Voice of America (VOA), China Central
TV (CCTV) and KAZN-AM, a commercial radio station based in Los Angeles, CA. Of these
three sources, the first two comprise the bulk of the collection and are represented
in roughly equal amounts. Only a relatively small sample of KAZN-AM recordings is
included, owing to the relatively high proportion of unusable material in that source(e.g.,
commercials, local traffic reports). Corresponding audio files are released as 1997
Mandarin Broadcast News Speech (HUB4-NE) LDC98S73. *Data* The transcripts were created
by native speakers of Mandarin working at LDC. They are in GB-encoded form with SGML
tags to identify story boundaries, speaker turn boundaries and phrasal pauses. The
tags include time stamps to align the text with the speech data. Word segmentation
(white-space between words) is included. A working DTD is provided, and the markup
is consistent with that of the 1997 English and Spanish HUB4 collections.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wu, Xuling
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yan, Yongmin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Qin, Zhoakai
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98T24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98T25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 770-765-444-577-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT Pilot Study Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98T25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The TDT Pilot Study corpus was created to support an initiative in "topic detection
and tracking." This initiative is directed toward computer processing of language
data, both text and speech. The objective is namely to explore techniques for detecting
the appearance of new and unexpected topics and for tracking the reappearance and
evaluation of them. *Data* The TDT corpus comprises a set of stories that includes
both newswire (text) and broadcast news (speech). Each story is represented as a stream
of text, in which the text is either taken directly from the newswire (Reuters) or
is a manual transcription of the broadcast news speech (CNN). The corpus spans the
period from July 1, 1994 to June 30, 1995. It contains approximately 16,000 stories,
with about half taken from Reuters newswire and half from CNN broadcast news transcripts.
An integral and key part of the corpus is the annotation of the corpus in terms of
the events discussed in the stories. 25 events were defined that span a variety of
event types and that cover a subset of the events discussed in the corpus stories.
Annotation data for these events are included in the corpus and provide a basis for
training TDT systems.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Allan, James
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yang, Yiming
ADDED ENTRY--PERSONAL NAME
- Personal name:
Carbonell, Jaime
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yamron, Jon
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wayne, Charles
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98T25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631329
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98T26
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 943-578-103-129-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
HUB5 Mandarin Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98T26
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release of HUB5 Mandarin training data consists of 42 calls derived from the
CALLFRIEND Mandarin Chinese Mainland Dialect (Language ID) collection. The transcribed
data is intended as additional training data in support of the project on Large Vocabulary
Conversational Speech Recognition (LVCSR), also sponsored by the U.S. Department of
Defense. The transcripts cover a contiguous 5-30 minute segment taken from a recorded
conversation lasting up to 30 minutes. *Data* Speakers were solicited by the LDC to
participate in this telephone speech collection effort via the internet, publications
(advertisements) and personal contacts. A total of 200 call originators were found,
each of whom placed a telephone call via a toll-free robot operator maintained by
the LDC. Access to the robot operator was possible via a unique Personal Identification
Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in
the project. The participants were made aware that their telephone call would be recorded,
as were the call recipients. The call was allowed only if both parties agreed to being
recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion
of the call, the caller was paid $20 (in addition to making a free long-distance telephone
call). Each caller was allowed to place only one telephone call. They were given no
guidelines concerning what they should talk about. Once a caller was recruited to
participate, he/she was given a free choice of whom to call. Most participants called
family members or close friends. All calls originated in North America and were placed
to various locations within North America.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
McIntyre, Robert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98T26
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631345
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98T27
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 997-940-878-462-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
HUB5 Spanish Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98T27
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release of HUB5 Spanish training data consists of 106 calls derived from the
CALLFRIEND Spanish (Language ID) collection. The transcripts cover a contiguous 10-30
minute segment taken from a recorded conversation lasting up to 30 minutes. These
calls were originally collected by the LDC in support of the project on Language Recognition,
sponsored by the U.S. Department of Defense. All of these calls are being designated
as additional training data for the project on Large Vocabulary Conversational Speech
Recognition (LVCSR) in Spanish. *Data* Speakers were solicited by the LDC to participate
in this telephone speech collection effort via the internet, publications (advertisements)
and personal contacts. A total of 200 call originators were found, each of whom placed
a telephone call via a toll-free robot operator maintained by the LDC. Access to the
robot operator was possible via a unique Personal Identification Number (PIN) issued
by the recruiting staff at the LDC when the caller enrolled in the project. Once a
caller was recruited to participate, he/she was given a free choice of whom to call.
Recruits were given no guidelines concerning what they should talk about. Most participants
called family members or close friends. All calls originated in North America and
were placed to various locations within North America, Puerto Rico or the Dominican
Republic. The participants were made aware that their telephone call would be recorded,
as were the call recipients. The call was allowed only if both parties agreed to being
recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion
of the call, the caller was paid $20 (in addition to making a free long-distance telephone
call). Each caller was allowed to place only one telephone call. HUB5 Spanish speech
and transcript data may be obtained by contacting the LDC
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Munoz, Elisa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alabiso, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
MacIntyre, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98T27
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631248
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98T28
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 789-160-485-831-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 English Broadcast News Transcripts (HUB4)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98T28
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication has been prepared to serve as a supplement to the 1996 Broadcast
News Speech collection (consisting of over 100 hours of similar recordings). The primary
motivation for this collection is to provide additional training data for the DARPA
"HUB4" Project on continuous speech recognition in the broadcast domain. *Data* This
set of 18 CD-ROMs contains a total of 97 hours of recordings from radio and television
news broadcasts, gathered between June 1997 and February 1998. Transcripts have been
made of all recordings in this publication, manually time aligned to the phrasal level,
annotated to identify boundaries between news stories, speaker turn boundaries and
gender information about the speakers. The transcription conventions are described
in the file "transcrp.doc" -- please note that this file describes the transcription
methods by reference to text formatting conventions used internally by the LDC during
the transcription process. The released version of the transcripts is in SGML format,
comparable to the format that was used in the 1996 Broadcast News Speech transcriptions
and there is accompanying documentation and an SGML DTD file, included with the transcription
release.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alabiso, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
MacIntyre, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98T28
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631280
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98T29
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 873-191-836-513-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 Spanish Broadcast News Transcripts (HUB4-NE)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98T29
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus contains a portion of the acoustic data designated as the training set
for the 1997 DARPA HUB4 Spanish Benchmark. It contains speech and transcripts of 30
hours of broadcast news from the following sources: Televisa, Univision and VOA. Corresponding
speech data is released as 1997 Spanish Broadcast News Speech (HUB4-NE) (LDC98S74)
*Data* All acoustic files are in NIST SPHERE format, without compression. The sample
data are 16-bit linear PCM, 16-KHz sample frequency, single channel. Most files contain
30 minutes of recorded material, and some contain 60 or 120 minutes (approximately);
the sampling format requires roughly two megabytes (MB) per minute of recording, so
the file sizes are typically around 60 MB, with some files ranging up to 120 or 240
MB. The transcripts are in SGML format, using the same markup conventions that have
been applied to the other 1997 Broadcast News speech corpora (in English and Mandarin).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Munoz, Elisa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alabiso, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98T29
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98T30
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 686-158-826-526-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
North American News Text Supplement
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98T30
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release of North American News Text provides a supplement to the LDC's earlier
publication of similar materials (LDC95T21: North American News Text Corpus). The
same TIPSTER-style SGML markup is used in formatting the data. The data sources are
as follows: Source Dates Approx. # Words Covered (Millions) -------------------------------------------------------
Los Angeles Times & 09/97-04/98 11 Washington Post New York Times News 01/97-04/98
116 Syndicate Associated Press 11/94-04/98 143 World Stream English -------------------------------------------------------
The previous North American News release included prior materials from both the LA
Times/Washington Post and the New York Times; this supplement provides the continuation
of those sources. *Data* The LDC has been collecting the Associated Press Worldstream
newswire service in six languages since 1994. The is the first release of the English
language portion of this service. The material in this set is typically NOT North
American in origin -- the reporters who provide the stories may or may not be American
born, but the locations and topics covered are much more heavily international in
comparison to the North American wire services. Reports from Asia, Africa and Europe
are found here that show up only rarely or not at all in North American newspapers,
including political, financial and sports stories that are presumably geared to English-speaking
readers in those parts of the world. This release, when combined with the LDC's earlier
NA News Text Corpus, constitutes all the English-language newswire text collected
by the LDC between January 1994 and April 1998, inclusive.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
MacIntyre, Robert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98T30
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631221
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98T31
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 905-430-625-113-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1996 CSR HUB4 Language Model
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98T31
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus contains data from transcribed news broadcasts, designated for use in
the baseline language model (LM) for the 1996 CSR HUB4 Evaluation. *Data* The LDC
obtained the bulk of the data from broadcast news CD-ROMs produced by Primary Source
Media, Inc. This portion includes the period from January 1992 to April 1996 and contains
approximately one gigabyte of data uncompressed. This release also includes about
36 megabytes of material received on floppy disks covering the period from late May
through June 1996, with somewhat different format from the bulk of the data. The text
data are presented in two forms: (1) a relatively unprocessed ("raw" or "sentence-tagged")
form and (2) a fully processed ("conditioned," "verbalized-punctuation") form. The
"raw" form includes the header and footer information accompanying the articles, such
as network, show name, headline, copyright, credits and so forth; the text and ancillary
data are presented in a fairly consistent (though simple) SGML format. The "processed"
form contains only the text content of the articles, together with SGML tags to mark
the boundaries of articles, paragraphs and sentences; the text content has been modified
by replacing numeric strings (dates, dollar amounts, quantities) with orthographic
strings (e.g. "nineteen ninety six"), replacing abbreviations ("Inc.," "Ltd.," "Corp.,"
etc.) with corresponding full-word forms and replacing punctuation characters with
corresponding word tokens (e.g. "," becomes "COMMA"). This release also includes an
archive of the tools used to create the "processed" form of the data.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
MacIntyre, Robert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98T31
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1998 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631353
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC98T32
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 380-519-899-609-3
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1998]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC98T32
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication represents a release of the JURIS (Justice Department Retrieval and
Inquiry System) data collection that has been made available to the Linguistic Data
Consortium (LDC) by the U.S. Department of Justice. The time span of the text ranges
from the 1700s to the early 1990s. *Data* There are 1,664 individual text files in
the corpus, 1011 on the first CD-ROM and 653 on the second. The original archive consisted
of 219 files ranging between less than 1 MB and nearly 70 MB in size. In order to
make the data more accessible for researchuse, we chose to divide the larger files
into pieces, such that the average file size was about 2 MB when uncompressed (the
largest uncompressed file size is about 4.5 MB). Divisions of the files were done
at document boundaries, so all files contain whole documents. There are a total of
694,667 document units in the corpus and these can be categorized to some extent with
regard to their content. The following is a partial list of categories and their descriptions
drawn from JURIS documentation contained in the corpus. The terminology and organization
of categories are those used in the JURIS documentation: * Case Law * Executive Order
* Regulations * Federal Register * Statutory Law * Administrative Law * International
Agreements * Freedom of Information Act and related documents * Indian Law * Tax Law
* Brief As many of the documents contain Social Security Numbers of the parties involved,
these have been redacted to protect the privacy of those individuals. All valid Social
Security Numbers have been replaced with the string XXX-XX-XXXX. In some documents,
number strings may be identified as Social Security Numbers, but they are in fact
substitutions such as the series, 123-45-6789 or 987-65-4321. These ersatz numbers
have been left unchanged. Some personal names have also been redacted and replaced
with XXXXXXXX.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morgovsky, Paul
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC98T32
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1997 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631558
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99L22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 155-303-991-688-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Egyptian Colloquial Arabic Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1997]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99L22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This lexicon represents the first electronic pronunciation dictionary of Egyptian
Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt. The dialect
of ECA that this dictionary represents is Cairene Arabic. *Data* The lexicon contains
51,202 entries, drawn from 140 CALLHOME telephone conversations among native speakers
of Colloquial Egyptian Arabic, collected and published by the LDC as follows: CALLHOME
Egyptian Arabic Speech LDC97S45, CALLHOME Egyptian Arabic Transcripts LDC97T19, CALLHOME
Egyptian Arabic Speech Supplement LDC200237 and CALLHOME Egyptian Arabic Transcripts
Supplement LDC2002T38. The lexicon also contains entries derived manually from the
Badawi & Hines dictionary of Colloquial Egyptian Arabic. The lexical entries are written
one to a line with tab-separated fields, including orthographic representation in
both the LDC romanization as well as Arabic script, morphological, phonological, stress,
source, and frequency information for each word. Here is a sample page. Relative to
earlier versions of the Arabic Pronouncing Lexicon, this release provides not only
a significant increase in the number of entries, but also a significant effort to
improve the quality and consistency of all entries.
LANGUAGE NOTE
- Language note:
Content in Egyptian Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kilany, Hanaa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gadalla, H.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Arram, Howaida
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yacoub, A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
El-Habashi, Alaa
ADDED ENTRY--PERSONAL NAME
- Personal name:
McLemore, C.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99L22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631566
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99L23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 238-033-984-489-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
American English Spoken Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99L23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This lexicon contains pronunciations captured in individual audio files for 53,602
of the most common words in English. *Data* 50,892 words were chosen from LDC's CALLHOME
American English Lexicon on the basis of their frequency in the data that were used
in creating the 1994 CSR Language Model Text Corpus ("CSR-III Text Corpus," LDC95T6).
The sources for the language model include Wall Street Journal (1987-1994), Associated
Press (1989-1991), and San Jose Mercury News (1991); all taken from the three CD-ROM
volumes of TIPSTER (LDC93T3A). To extend the coverage of common words that happen
not to occur in the LDC corpora sampled, an additional 2,922 words (ie. compounds,
companies, places, languages, and numerals) were added from other sources. Each word
was read by the speaker in a quiet recording studio, using a Sennheiser HMD 410 microphone
and a Sony DAT recorder. The recordings were downsampled to 16KHz for storage on disk
with the individual lexical utterances segmented into separate waveform files, with
a consistent margin of silence on both sides of each word. The CD-ROMs were created
using the ISO-9660 Level 2 data format, along with Rock Ridge extensions. All common
computer operating systems should be able to read the full-length file names. The
corpus has since been converted to a web downloaded file.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Seidl-Friedman, Amanda Hallie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kobayashi, Masato
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99L23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631450
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99S78
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 651-984-335-857-7
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99S78
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Speech Under Simulated and Actual Stress (SUSAS) was created by the Robust Speech
Processing Laboratory at the University of Colorado-Boulder under the direction of
Professor John H. L. Hansen and sponsored by the Air Force Research Laboratory. *Data*
The database is partitioned into four domains, encompassing a wide variety of stresses
and emotions. A total of 32 speakers (13 female, 19 male), with ages ranging from
22 to 76 years were employed to generate in excess of 16,000 utterances. SUSAS also
contains several longer speech files from four Apache helicopter pilots. Those helicopter
speech files were transcribed by the Linguistic Data Consortium and are available
in SUSAS Transcripts (LDC99T33). A common highly confusable vocabulary set of 35 aircraft
communication words make up the database. All speech tokens were sampled using a 16-bit
A/D converter at a sample rate of 8kHz.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hansen, John H.L.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99S78
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631442
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99S79
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 777-596-800-424-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Switchboard-2 Phase II
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99S79
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
SWB-2 Phase II consists of 4,472 five-minute telephone conversations involving 679
participants. This corpus was collected by the Linguistic Data Consortium (LDC) in
support of a project on Speaker Recognition sponsored by the U.S. Department of Defense.
*Data* Participants in SWB-2 Phase II were recruited from the following midwestern
college campuses: Iowa State University, Michigan State University, University of
Michigan, University of Minnesota, University of Wisconsin at Madison, Northwestern
University, and Ohio State University. Solicitation methods included the Internet,
newspaper advertisements and personal contacts. The majority of the participants resided
in Minnesota, Wisconsin, Ohio, Iowa, Michigan and Illinois as follows: Minnesota -
156 speakers Wisconsin -- 105 speakers Ohio -- 70 speakers Iowa 64 speakers Michigan
-- 41 speakers Illinois - 37 speakers Each recruit was asked to participate in at
least ten five-minute phone calls. Ideally each participant would receive five calls
at a designated number and make five calls from phones with different (ANI) codes.
Participants were asked to discuss a specific topic (read by the automated operator)
and not to provide personal information during their call. Each of the 679 participants
placed their calls via a toll-free robot operator maintained by LDC. Access to the
robot operator was possible via a unique Personal Identification Number (PIN) issued
by the recruiting staff at LDC when the caller enrolled in the project. Upon conclusion
of the study all calls were audited by LDC staff members. Particular attention was
paid to PIN verification (matching speaker with PIN), checking call duration, and
call quality. Upon completion of this process, checks were issued and mailed to participants.
The conversations have not been transcribed.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99S79
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631426
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99S80
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 095-881-879-489-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1997 Speaker Recognition Benchmark
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99S80
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1997 speaker recognition evaluation was part of an ongoing series of yearly evaluations
conducted by NIST. These evaluations provide an important contribution to the direction
of research efforts and the calibration of technical capabilities. They are intended
to be of interest to all researchers working on the general problem of text independent
speaker recognition. To this end the evaluation was designed to be simple, to focus
on core technology issues, to be fully supported, and to be accessible. *Data* Technical
Objectives of the 1997 speaker recognition evaluation were: 1. Exploring promising
new ideas in speaker recognition 2. Developing advanced technology incorporating these
ideas 3. Measuring the performance of this technology The evaluation data was drawn
from the Switchboard-2 Phase 1 corpus. Both training and test segments were constructed
by concatenating consecutive turns for the desired speaker, similar to what was done
in 1996. Each segment is stored as a continuous speech signal in a separate SPHERE
file. The speech data is stored in 8-bit mulaw format.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99S80
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631523
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99S81
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 282-712-829-978-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1999 Speaker Recognition Benchmark
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99S81
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 1999 speaker recognition evaluation is part of an ongoing series of yearly evaluations
conducted by NIST. These evaluations provide an important contribution to the direction
of research efforts and the calibration of technical capabilities. They are intended
to be of interest to all researchers working on the general problem of text independent
speaker recognition. *Data* Technical Objectives of the 1999 speaker recognition evaluation
were: 1. Exploring promising new ideas in speaker recognition 2. Developing advanced
technology incorporating these ideas 3. Measuring the performance of this technology
The evaluation data was drawn from the Switchboard-2 Phase 3 corpus. Both training
and test segments were constructed by concatenating consecutive turns for the desired
speaker, similar to what was done in 1996. Each segment is stored as a continuous
speech signal in a separate SPHERE file. The speech data is stored in 8-bit mulaw
format.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99S81
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631507
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99S82
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 904-484-855-805-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
USC Marketplace Broadcast News Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99S82
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The USC Marketplace Broadcast News Corpus contains approximately 40 hours of audio
data, which was recorded daily between May 1, 1996 and September 18, 1996. Corresponding
transcript files were created by Federal Document Clearing House and enhanced by the
LDC to include: story boundaries, disfluency markers, and speaker and gender identification.
In keeping with HUB4 style transcription conventions, LDC spelled all digit strings
in standard orthography. Commercial and music segments, while a part of the audio
publication, were excluded from the transcripts. The timestamps mark the beginning
of each speaker turn relative to the beginning of the recording and are precise to
the 100th of a second. Although the transcripts were created using HUB4 conventions,
the second and third pass quality checks, typically required by government sponsored
evaluation projects, were skipped. *Data* The USC Marketplace recordings from the
summer of 1996 were received on digital audio tapes (DATs) from the University of
Southern California. LDC excluded from this set the roughly seven hours of broadcast
that are currently included in the 1996 English Broadcast News publication. Marketplace
is produced by USC Radio in Los Angeles, a division of the University of Southern
California.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morgovsky, Paul
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99S82
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99S83
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 389-320-759-767-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Tactical Speaker Identification Speech Corpus (TSID)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99S83
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Tactical Speaker Identification Corpus (TSID), which was collected by Douglas
Reynolds and Gerald C. O'Leary of MIT Lincoln Labs, contains recordings of 35 speakers
(four female, 31 male), using a variety of different radio transmitters and receivers.
*Data* The recording sessions were conducted by assembling the speakers into seven
groups of five, then having each speaker perform the following tasks: - read a list
of TIMIT sentences - read a list of digit strings - give directions for traveling
from one point to another using a map (unscripted map task) Each speaker performed
this set of tasks on each of three transmitters (xmtr1-3), and the utterances were
recorded simlutaneously on DAT recorders attached to each of six receivers (rcvr1-6),
which were located at some distance (well out of ear-shot) from the transmitter. Recordings
were also made at the same time on a DAT recorder near the speaker using a head-mounted
microphone to provide a reference wide-band recording of the speech (refwb). As a
result, the corpus is organized along four dimensions: speaker, transmitter, receiver,
and speaking task; this organization can be viewed as a four-dimensional matrix, with
35x3x7x3 cells. Due to some occasional mishaps and malfunctions during the collection,
some cells in this matrix are either empty or only partially full. In addition to
the tasks listed above, three pairs of speakers also participated in a two-way map
task using xmtr3; in this case, one of the speakers in the task gives directions to
the other for tracing a route on a map, and both speakers are recorded on a single
audio channel at each of the receivers (except for the "refwb" recording: the two
speakers were separated by some distance, using radio communication to perform the
task, and only one of them used a head-mounted microphone and local DAT recorder for
wide-band recording).
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Reynolds, Douglas
ADDED ENTRY--PERSONAL NAME
- Personal name:
O'Leary, Gerald C.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99S83
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633917
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S37
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 331-222-724-302-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
West Point Heroico Spanish Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S37
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on West Point Heroico Spanish Speech, Linguistic
Data Consortium (LDC) catalog number LDC2006S37 and ISBN 1-58563-391-7. West Point
Heroico Spanish Speech is a database of digital recordings of spoken Spanish. It was
designed and collected by staff and faculty of the Department of Foreign Languages
(DFL) and Center for Technology Enhanced Language Learning (CTELL) to develop acoustic
models for speech recognition systems. The U.S. government uses these systems to provide
speech-recognition enhanced language learning courseware to government linguists and
students enrolled in various government language programs. Additionally, parts of
this corpus were designed to model question/answer dialogues for use in domain-specific
speech-to-speech translation systems. The corpus consists of two subcorpora, one collected
in September 2001 at El Heroico Colegio Militar (HEROICO), the Mexican Military Academy
in Mexico City, and the other at USMA at different times since 1997. The USMA subcorpus
includes data from non-native speakers and data collected through a throat microphone.
*Data* Two kinds of prompt scripts were used, one to elicit read speech and one for
free-response answers to questions. The read speech prompts are also divided into
two groups, one designed to elicit speech typical of language learning scenarios and
the other for speech from educated native speakers. The scripts used to record read
speech have a total of 724 distinct sentences. This number includes 205 short, simple
sentences used in typical language learning scenarios. The other 519 sentences were
extracted from lecture notes used at USMA in a military readings course. All of the
read speech prompts are listed in two files in the transcripts directory: HEROICO-
Recordings.txt and USMA-prompts.txt, containing the sentences read by informants at
the Mexican Military Academy and USMA, respectively. Each line of these files has
two fields separated by a tab, the first denoting the base name of the waveform file,
and the second the prompt used in recording the utterence. The read speech data collected
from informants at HEROICO are stored in the HEROICO/Recordings Spanish directory.
The script used to elicit free-response answers contains 143 questions. The text that
was actually presented to the informants is in the file named questions.txt in the
transcripts directory. Data recorded from these prompts are stored in the HEROICO/Answers
Spanish directory. The human-performed transcriptions of the informants answers are
listed in the HEROICO-Answers.txt file in the transcripts directory. Again, each line
of this file has two fields separated by a tab the first field contains two numbers
separated by a slash. The first number is an identification index for the speaker.
The second number is an index to the question. The second field on the line contains
a word level transcription of the informants answer to the question indexed by the
second number in the first field. So for example in the line: 100/10 no ella no tiene
barba ni bigote no ella no tiene barba ni bigote is a transcription of the response
speaker 100 gave to question 10. The corresponding waveform file is stored in the
file 10.wav in the directory HEROICOAnswers Spanish100. Each speaker in the HEROICO
subcorpus attempted to record 100 utter- ances by reading 75 sentences and giving
25 free-response answers to questions. Both native and non-native USMA informatnts
read from the list of 205 simple sentences. The prompts used in the USMA subcorpus
are listed in the file USMA-prompts.txt in the transcripts directory. This file has
the same two-field format as the above transcription files. Some of the USMA informants
wore an additional throat microphone. That data was recorded in a separate stream
and stored in files whose names begin with the letter t. Data collected at USMA are
stored under the USMA directory. The names of the directories under the USMA directory
indicate whether the speaker was native or non-native. The speakers native country
is also indicated in the case of native speakers. Speech data was collected at HEROICO
using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size
and sampling rate of 22,050 Hz. The recording script presented a visual display of
the sentence to be recorded. The informant pressed a key and spoke the sentence. The
recording was played back for review allowing the utterance to be re- recorded. A
member of the data collection team was on hand during the recording session to verify
recordings and provide technical assistance in case of malfunctioning equipment. The
data from USMA was collected using several different microphones and formats. Most
of the data were recorded on Pentium computers running Linux through an m-10 Shuer
head-mounted microphone. Entropics ESPS programs were used in most cases, especially
when both head-mounted and throat microphones were used.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Spanish language
- Form subdivision:
Databases.
- General subdivision:
Spoken Spanish
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morgan, John
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S37
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99S84
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 350-408-401-790-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT2 English Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99S84
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically
related material in streams of data such as newswire and broadcast news. The TDT2
corpus was created to support three TDT2 tasks: find topically homogeneous sections
(segmentation), detect the occurrrence of new events (detection), and track the reoccurrencce
of old or new events (tracking). *Data* The TDT2 Audio Corpus contains a total of
1,036 waveform files. Each file is a complete single-channel recording of 30- or 60-minute
broadcast, which has been digitized at a sample rate of 16 KHz using 16-bit samples.
The four broadcast sources represented in the corpus with their format and programning
frequency are as follows: ABC World News Tonight -- "traditional" network news, 30
minutes/day CNN Headline News -- continuous news summaries, up to 4 30-minute samples/day
PRI The World -- "in-depth" radio news, 60 minutes/weekday VOA -- varied 60-minute
news programs, up to 2/day
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99S84
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631469
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99T33
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 747-562-690-531-2
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99T33
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
SUSAS (Speech Under Simulated and Actual Stress) Transcripts was developed by the
Linguistic Data Consotium and consists of transcribed English speech by helicopter
pilots. The speech data in this release is a subset of the data in SUSAS (LDC99S78),
created by the Robust Speech Processing Laboratory at the University of Colorado-Boulder
under the direction of Professor John H. L. Hansen and sponsored by the Air Force
Research Laboratory. *Data* The transcripts in this release cover several speech files
in the SUSAS collection, specifically, speech from four Apache helicopter pilots.
The SUSAS speech database is partitioned into four domains, encompassing a wide variety
of stresses and emotions. A total of 32 speakers (13 female, 19 male), with ages ranging
from 22 to 76 years, were employed to generate in excess of 16,000 utterances. A common
highly confusable vocabulary set of 35 aircraft communication words make up the database.
All speech tokens were sampled using a 16-bit A/D converter at a sample rate of 8kHz.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hansen, John H.L.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99T33
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631434
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99T34
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 768-601-383-003-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Japanese Business News Text Supplement
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99T34
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus consists of newswire text from Nihon Keizai Shimbun, Inc. (NIKKEI), the
largest Japanese daily financial newspaper, and Telerate, Inc. (formerly known as
Dow Jones/Kyodo News Service), published primarily for managers of Japanese-owned
corporations or Japanese employees working in North American financial institutions.
The Telerate portion constitutes all newswire text collected by the LDC between December
1994 and September 1998. The Telerate data collected from June 1995 to September 1998
serves as a supplement to the original publication. All NIKKEI data was collected
from December 1993 to November 1994 and is also available on the 1995 release of the
Japanese Business News Text. The data, including SGML tags, breaks down as follows.
# of Files Daily Average Size Total Size --------------------------------------------------
NIKKEI 364 514K 188MB Telerate 1060 336K 357MB The NIKKEI text was received on nine-track
magnetic tape. The original character encoding was EBCDIC, but was converted to EUC
encoding, which the LDC uses for its Japanese publications. The Telerate text was
received via a digital transmission service installed at the LDC by Telerate. Custom
software was written by the LDC to poll a central database and download articles individually.
The character encoding is EUC. LDC added SGML tags automatically in order to identify
individual stories within the daily collections.
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kobayashi, Masato
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99T34
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631515
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99T36
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 380-096-969-630-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
USC Marketplace Broadcast News Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99T36
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The USC Marketplace Broadcast News Corpus contains approximately 40 hours of audio
data, which was recorded daily between May 1, 1996 and September 18, 1996. Corresponding
transcript files were created by Federal Document Clearing House and enhanced by the
LDC to include: story boundaries, disfluency markers, and speaker and gender identification.
In keeping with HUB4 style transcription conventions, LDC spelled all digit strings
in standard orthography. Commercial and music segments, while a part of the audio
publication, were excluded from the transcripts. The timestamps mark the beginning
of each speaker turn relative to the beginning of the recording and are precise to
the 100th of a second. Although the transcripts were created using HUB4 conventions,
the second and third pass quality checks, typically required by government sponsored
evaluation projects, were skipped. *Data* The USC Marketplace recordings from the
summer of 1996 were received on digital audio tapes (DATs) from the University of
Southern California. LDC excluded from this set the roughly seven hours of broadcast
that are currently included in the 1996 English Broadcast News publication. Marketplace
is produced by USC Radio in Los Angeles, a division of the University of Southern
California.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
- General subdivision:
Language
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99T36
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u por d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631604
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99T40
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 632-462-858-283-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
por
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
por
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Portuguese Newswire Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99T40
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus builds on the Portuguese data published previously in the European Language
Newswire Text Corpus and contains the previously published material, as well as more
recent material. *Data* The data in this corpus comes from Agence France Presse from
May 13, 1994 through December 31, 1998 (June 27, 1996 - December 31, 1998 was previously
unpublished by the LDC). The data has been tagged using SGML to identify article boundaries.
LANGUAGE NOTE
- Language note:
Content in Portuguese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wright, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99T40
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631620
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99T41
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 581-480-117-182-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Spanish Newswire Text, Volume 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99T41
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release of Spanish newswire contains data from the following sources: * Agence
France Presse (January 13, 1996--December 13,1998) * Associated Press Worldstream
(December 1, 1995--August 31, 1998) * El Norte (January 1, 1997--December 31, 1998)
*Data* The consistent format chosen for release consists of SGML tagging and the ISO-8859-1
(Latin1) 8-bit character set. Our general strategy for SGML tagging is as follows:
All document units (articles) are bounded by the tags DOC and /DOC, and within these
units, the text content of each article is bounded by TEXT and /TEXT. Following each
DOC tag is a DOCID tag that provides a unique identifying string for that article.
Other tags within the DOC unit (but external to TEXT) provide additional information
that was receieved with the article (e.g. headline, dateline, byline, keywords, etc),
but the inventory and nature of additional information varies from one source to the
next (and in some cases, from one article to the next), and this variability is reflected
in the SGML tags that are used to preserve the information. Within the TEXT units,
tagging is kept to a minimum, typically consisting only of paragraph tags.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gallegos, Gustavo
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99T41
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1999 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585631639
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC99T42
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 141-282-691-413-2
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1999]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC99T42
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release contains the following Treebank-2 Material: * One million words of 1989
Wall Street Journal material annotated in Treebank II style. * A small sample of ATIS-3
material annotated in Treebank II style. * A fully tagged version of the Brown Corpus.
and the following new material: * Switchboard tagged, dysfluency-annotated, and parsed
text * Brown parsed text The Treebank bracketing style is designed to allow the extraction
of simple predicate/argument structure. Over one million words of text are provided
with this bracketing applied. *Data* The Penn Treebank (PTB) project selected 2,499
stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for
syntactic annotation. These 2,499 stories have been distributed in both Treebank-2
(LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the
raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz)
as an additional download for users who have licensed Treebank-2 and provide the relation
between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitchell P.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Santorini, Beatrice
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcinkiewicz, Mary Ann
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taylor, Ann
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC99T42
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633925
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 108-265-441-199-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Foreign Accented English Release 1.2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on CSLU: Foreign Accented English Release 1.2, Linguistic
Data Consortium (LDC) catalog number LDC2006S38 and isbn 1-58563-392-5. CSLU: Foreign
Accented English Release 1.2 consists of continuous speech in English by native speakers
of 22 different languages: Arabic, Cantonese, Czech, Farsi, French, German, Hindi,
Hungarian, Indonesian, Italian, Japanese, Korean, Mandarin Chinese, Malay, Polish,
Portuguese (Brazilian and Iberian), Russian, Swedish, Spanish, Swahili, Tamil and
Vietnamese. The corpus contains 4925 telephone-quality utterances, information about
the speakers' linguistic backgrounds and perceptual judgments about the accents in
the utterances. The speakers were asked to speak about themselves in English for 20
seconds. Three native speakers of American English independently listened to each
utterance and judged the speakers' accents on a 4-point scale: negligible/no accent,
mild accent, strong accent and very strong accent. This corpus is intended to support
the study of the underlying characteristics of foreign accent and to enable research,
development and evaluation of algorithms for the identification and understanding
of accented speech. Some of the files in this corpus are also contained in CSLU: 22
Languages Corpus, LDC2005S26.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Accents and accentuation
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Variation
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lander, T.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 910-955-859-747-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Yes/No Version 1.2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for CSLU:Yes/No Version 1.2, Linguistic Data Consortium
(LDC) catalog number LDC2007S05 and isbn 1-58563-445-X. CSLU: Yes/No Version 1.2 is
a collection of answers to yes/no questions from various telephone speech corpora
created by the Center for Spoken Language Understanding, Oregon Health and Science
University (CSLU). The corpus contains approximately 20,000 examples of roughly 18,000
speakers saying "yes" or "no" in response to various questions. Each speech file in
the corpus has a corresopnding orthographic transcription following the CSLU Labeling
Conventions. In cases where a transcription did not already exist, the utterance was
run through a speech recognizer to automatically obtain the transcription. The data
were collected from both analog and digital phone lines. The analog data were recorded
using a Gradient Technologies analog-to-digital conversion box. These files were recorded
as 16-bit, 8 khz and stored in a linear format. The digital data were recorded with
the CSLU T1 digital data collection system. These files were sampled at 8 khz 8-bit
and stored as ulaw files. All of the data use the RIFF standard file format. This
file format is 16-bit linearly encoded.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Noel, Mike
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634425
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 378-255-081-778-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Mandarin Affective Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Mandarin Affective Speech is a database of emotional speech consisting of audio recordings
and corresponding transcripts collected in 2005 at the Advance Computing and System
Laboratory, College of Computer Science and Technology, Zhejiang University, Hangzhou,
People's Republic of China. This corpus was designed with two goals: first, to serve
as a tool for linguistic and prosodic feature investigation of emotional expression
in Mandarin Chinese; and second, to provide a source of training and test data essential
to support research in speaker recognition with affective speech. The speech database
was recorded by eliciting speakers to express different emotional states in response
to stimuli. The speakers read scenarios designed to elicite an emotional response
such as a colleague's mistake for anger, a pleasant trip for elation, a hurry-up scene
for panic and a puppy's death for sadness. The five emotional states recorded are
characterized as follows: * Neutral - Simple statements without any emotion. * Anger
- A strong feeling of displeasure or hostility. * Elation - Be glad or happy because
of praise. * Panic - A sudden, overpowering terror, often affecting many people at
once. * Sadness - Affected or characterized by sorrow or unhappiness *Data* Over 100
speakers participated in the data collection. After screening, recordings from 68
speakers (23 females, 45 males) were used in this corpus. Most of the speakers were
in their twenties at the time of collection. Information about the speakers is contained
in "SpeakerInfo.doc." Subjects were given a text to read that consisted of five phrases,
fifteen sentences and two paragraphs designed to generate the emotional speech. The
material included all the phonemes in Mandarin. Each subject read the phrases, paragraphs,
and sentences portraying the five emotional states: neutral (unemotional), anger,
elation, panic and sadness. Altogether this database contains 25,636 utterances. The
read material was constructed as follows: * 5 phrases - "yes", "no" and three nouns
as "apple", "train", "tennis ball". In Chinese, these words contain many different
basic vowels and consonants. * 20 sentences - These sentences include all the phonemes
and most common consonant clusters in Mandarin. The types of sentences are: simple
statements, a declarative sentence with an enumeration, general questions (yes/no
question), alternative questions, imperative sentences, exclamatory sentences, special
questions (whquestions). * 2 paragraphs - Two readings, one selected from a famous
Chinese novel, and the other stating a normal fact. All the data were recorded in
a quiet office on an OLYMPUS DM-20 digital voice recorder with a sampling rate of
22050Hz. Afterwards, the recorded voice files were transferred to a personal computer
by USB (Universal Serial Bus). The recordings were then converted into monophonic
Windows PCM format at 8 kHz sampling frequency and 16 bits resolution. Further information
about the data and methodology in this corpus is contained in the authors' paper,
"MASC: A Speech Corpus in Mandarin for Emotional Analysis and Affective Speaker Recognition,"
in "MASC.pdf."
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yang, Yingchun
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wu, Zhaohui
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Dongdong
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634468
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 951-213-258-921-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arz
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2003 NIST Rich Transcription Evaluation Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2003 NIST Rich Transcription Evaluation Data contains the test material used in the
2003 Rich Transcription Spring and Fall evaluations administered by the NIST (National
Institute of Standards and Technology) Speech Group. The Spring evaluation (RT-03S),
implemented in March-April 2003, focused on Speech-To-Text (STT) tasks for broadcast
news speech and conversational telephone speech in three languages: English, Mandarin
Chinese and Arabic. That evaluation also included one Metadata Extraction (MDE) task,
speaker diarization for broadcast news speech and conversational telephone speech
in English. The Fall evaluation (RT-03F), implemented in October 2003, focused on
MDE tasks including speaker diarization, speaker-attributed STT, SU (sentence/semantic
unit) detection and disfluency detection for broadcast news speech and conversational
telephone speech in English. For complete information about the evaluations, see the
RT-03 Spring Evaluation Website and the RT-03 Fall Evaluation Website. *Data* The
BN datasets were selected from TDT-4 sources collected in February 2001. The evaluation
excerpts were transcribed to the nearest story boundary. The English BN dataset is
approximately three hours long and is composed of 30-minute excerpts from six different
broadcasts. The Mandarin Chinese BN dataset is approximately one hour long, consisting
of 12-minute excerpts from five different broadcasts. The Arabic BN dataset is also
approximately one hour long and contains 30-minute excerpts from two different broadcasts.
The CTS datasets consist of material from various LDC telephone speech data. All evaluation
excerpts were transcribed to the nearest turn. The English CTS set is approximately
6 hours long and is composed of 5-minute excerpts from 72 different conversations:
36 from the Switchboard Cellular collection and 36 from the Fisher collection. The
Mandarin Chinese CTS dataset is approximately one hour long and consists of 5-minute
excerpts from 12 different conversations from the CallFriend Mandarin Chinese data.
The Arabic CTS set is also approximately one hour long and contains 5-minute excerpts
from 12 different conversations from the CallHome Egyptian Arabic data. No manual
(human-annotated) segmentations were provided. Sites were required to generate their
own segmentations automatically. Unlike the BN audio files where the full broadcasts
were provided, the CTS audio files contain only the evaluation excerpts. Each audio
excerpt is a SPHERE-headered, two channel interleaved 8-bit mulaw file.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Egyptian Arabic, Standard Arabic, and Mandarin Chinese. Documentation
in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doddington, George R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Le, Audrey
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sanders, Greg
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634492
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 686-386-828-766-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Nationwide Speech Project
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus represents part of the work of the Nationwide Speech Project (NSP) conducted
by the authors at Indiana University. The purpose of the NSP was to collect a large
amount of speech produced by male and female talkers representing the primary regional
varieties of American English: New England, Mid-Atlantic, North, Midland, South and
West. This release contains approximately 60 hours of speech or nearly one hour of
speech from each of 60 white American English speakers --including five male and five
female talkers from the six dialect regions -- reading words and sentences. The corpus
can be used for perceptual and acoustic experiments designed to explore the role of
variation in spoken language processing. Such applications include speech science
experiments and sociolinguistic or sociophonetic research. *Data* The speakers were
recruited from the Indiana University community; they were all 18-25 years old at
the time of recording, had lived exclusively in one region prior to age 18, and both
parents of each speaker were also raised in the same region. Further demographic information
about the speakers is provided in the file talkers.txt. The materials include 102
high predictability sentences and five repetitions of each of 10 hVd words. The high
predictability sentences are 5-8 words in length and the final word in each sentence
is highly predictable based on the preceding semantic context. The 10 hVd words are:
heed, hid, hayed, head, had, hod, hud, hoes, hood and who'd. Participants were recorded
one at a time by an experimenter in a sound attenuated booth (IAC Audiometric Testing
Room, Model 402). Both the experimenter and the participant sat in the sound booth
during testing. During the recording session, the participant was seated in front
of a ViewSonic LCD flatscreen monitor (ViewPanel VG151) which mirrored the screen
of a Macintosh Powerbook G3 laptop. The participant wore a Shure head-mounted microphone
(SM10A) that was positioned approximately one inch from the left corner of the talker's
mouth. The microphone output was fed to an Applied Research Technology microphone
tube pre-amplifier. The output gain on the pre-amplifier was adjusted by the experimenter
while the participant read the Grandfather Passage as a warm-up before recording began.
The output of the microphone pre-amplifier was connected to a Roland UA-30 USB audio
interface which digitized the signal and transmitted it via USB ports to the laptop
where each utterance was recorded in an individual AIFF 16-bit digital sound file
at a sampling rate of 44.1 kHz (converted to .wav format files for this release) The
experimenter held the laptop on her lap and wore headphones connected to the Roland
device so that she could hear the same audio signal that inputted into the laptop
for recording.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Clopper, Cynthia G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pisoni, David B.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633410
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 661-115-390-052-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the Arabic Treebank: Part 3 (full corpus) v 2.0
(MPG + Syntactic Analysis), Linguistic Data Consortium (LDC) catalog number LDC2005T20
and ISBN 1-58563-341-0. The goal of the Arabic Treebank project is to support the
development of data-driven approaches to natural language processing (NLP), human
language technologies, automatic content extraction (topic extraction and/or grammar
extraction), cross-lingual information retrieval, information detection, and other
forms of linguistic research on Modern Standard Arabic in general. The LDC was sponsored
to develop an Arabic POS and Treebank of 1,000,000 words, and this corpus is part
three of that project. In this release, we provide both syntactic (treebank) annotation
and annotation on part of speech (POS), gloss, and word segmentation. Treebanks are
language resources that provide annotations of natural languages at various levels
of structure: at the word level, the phrase level, and the sentence level. Treebanks
have become crucially important for the development of data-driven approaches to natural
language processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information retrieval,
information detection, and other forms of linguistic research in general. This corpus
is designed for those who study and use languages either professionally or academically,
and who need text corpora in their work. The Penn Arabic Treebank is particularly
suitable for language developers, computational linguists and computer scientists
who are interested in various aspects of natural language processing. The Penn Arabic
Treebank, which was part of the DARPA TIDES project, started in the Fall of 2001 with
the objective of annotating via human intervention and automatically a large Arabic
machine-readable text corpus. As in previous Penn Treebanks, two different kinds of
information need to be produced by two different (human and computer) processes. The
Arabic Treebank project consists therefore of two distinct phases: (a) Part-of-Speech
(=POS) tagging, which divides the text into lexical tokens and gives relevant information
about each token such as lexical category, inflectional features, and a gloss (referred
to as POS for convenience, although it includes morphological and gloss information
not traditionally included with part-of-speech annotation), and (b) Arabic Treebanking
(=ArabicTB), which characterizes the constituent structures of word sequences, provides
categories for each non-terminal node, and identifies null elements, co-reference,
traces, etc. Both tasks started in November 2001 with an initial pilot consisting
of 734 files representing roughly 166K words of written Modern Standard Arabic newswire
from the Agence France Presse corpus, which has since been released as "Arabic Treebank:
Part 1 v 3.0," LDC Catalog No. LDC2005T02. The second part was released as the 168K-word
corpus "Arabic Treebank: Part 2 v 2.0," LDC Catalog No. LDC2004T02. The current Arabic
Treebank: Part 3 corpus consists of 600 stories from the An Nahar News Agency. This
corpus is also referred to as ANNAHAR. The new features include complete vocalization
of all Imperfect Verb mood endings: Indicative, Subjunctive, and Jussive. The POS
only annotation of this ANNAHAR corpus was released in 2004 under the catalog number
LDC2004T11 (Arabic Treebank: Part 3 v 1.0). In addition to the treebank annotation,
this release (i.e., Arabic Treebank: Part 3 v 2.0) also includes the POS annotation
in LDC2004T11.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mekki, Wigdan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633623
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T33
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 375-520-999-436-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BBN Pronoun Coreference and Entity Type Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T33
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the BBN Pronoun Coreference and Entity Type Corpus,
Linguistic Data Consortium (LDC) catalog number LDC2005T33 and ISBN 1-58563-362-3.
This publication supplements the one million word Penn Treebank corpus of Wall Street
Journal texts (LDC95T7). The corpus contains stand-off annotation of pronoun coreference,
indicated by sentence and token numbers, as well as annotation of a variety of entity
and numeric types. All annotation was done by hand at BBN using proprietary annotation
tools. This corpus was developed by BBN to support the ACE and AQUAINT programs The
corpus contains two components: * Pronoun coreference. Stand-off annotation of pronoun
coreference of the WSJ corpus is provided in a single file. Pronouns and antecedents
are indexed by sentence and token numbers. * Entity types. The corpus includes annotation
of 12 named entity types (Person, Facility, Organization, GPE, Location, Nationality,
Product, Event, Work of Art, Law, Language, and Contact-Info), nine nominal entity
types (Person, Facility, Organization, GPE, Product, Plant, Animal, Substance, Disease
and Game), and seven numeric types (Date, Time, Percent, Money, Quantity, Ordinal
and Cardinal). Several of these types are further divided into subtypes. Annotation
for a total of 64 subtypes is provided.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Weischedel, Ralph
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brunstein, Ada
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T33
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633429
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 546-803-428-857-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the Levantine Arabic QT Training Data Set 4 (Speech
+ Transcripts), Linguistic Data Consortium (LDC) catalog number LDC2005S14 and ISBN
1-58563-342-9. This release contains 901 calls and the total speech is 133.6 hours
of telephone conversation in Levantine Arabic. Both audio and transcription files
are included in this package. The majority of speakers in this corpus are Lebanese.
The data is similar to the training data in Set 3 [LDC2005S07, speech and LDC2005T03,
transcripts]. The dialects are distributed as follows: * 171 JOR * 1373 LEB * 229
PAL * 29 SYR
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Levantine Arabic and South Levantine Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633526
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T32
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 254-896-342-130-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
HKUST Mandarin Telephone Transcript Data, Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T32
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
HKUST Mandarin Telephone Transcript Data Part 1 was developed by Hong Kong University
of Science and Technology (HKUST). In 2004 HKUST was contracted to collect and transcribe
200 hours of Mandarin Chinese conversational telephone speech from Mandarin speakers
in mainland China under the DARPA EARS framework. The first 50 hours of speech and
transcripts were released in June 2004 to the EARS community for the RT-04 NIST evaluation.
NIST partitioned the remaining 150 hours of collection into training, development
and evaluation sets. This release contains the training and development sets with
873 and 24 calls, respectively. Subject recruitment was done in several cities across
mainland China. Most subjects did not previously know each other. To encourage more
meaningful conversation, topics similar to those in Fisher English were designed.
All calls were operator-assisted, namely, an operator would call two participants
as scheduled to initiate a call. Subjects were asked about demographic questions before
they were bridged for normal conversation. Their answers to the demographic questions
were recorded on separate files. Subjects were allowed to talk up to 10 minutes. With
a few exceptions, most calls are of the maximum length. Although subjects were allowed
to make up to three calls, all subject made just one call in this release with one
exception, where PIN 10683 and PIN 10686 belong to a single individual. Each side
of a call was recorded on a separate .wav file, sampled at 8-bits (a-law encoded),
8Khz. They were multiplexed later in sphere format with a-law encoding preserved.
In the case where one side was shorter than the other, the shorter side was padded
with silence. In the release, the file name of each recorded call is in the format
of "date_time_Apin_Bpin.sph" and the corresponding transcript is in the same format
with .txt extension. *Speaker demographics* Subjects were asked to provide several
pieces of demographic information, including gender, age, native language/dialect,
birthplace, education, occupation, phone type, etc. Given that Standard Mandarin is
not the native dialect in many regions of China but is the official language of education
and speakers may or may not have regional accents speaking Mandarin, it was decided
that subjects' birthplaces were divided into Mandarin-dominant and non-Mandarin-dominant
regions and all calls were audited and classified into standard and accented types
without further distinctions. Selected demographics - age, gender, birthplace, phone
type and accent for each side of the call and the topic ID for the call - are provided
as a tab-delimited, plain-text, tabular file. *Transcription* All calls were fully
transcribed from the beginning to the end. Standard simplified Chinese characters,
encoded in GBK (CP-936), were used. Speech is segmented at natural boundaries wherever
possible and each segment is no more than 10 seconds long. HKUST formulated transcription
guidelines based on LDC's RT-03 transcription guidelines. For more information, refer
to "trans-guidelines.pdf" included in the release. The transcripts provided by HKUST
were XML-formatted with each side of a call in a separate file. LDC multiplexed the
two sides into a single file with turns interleaved in temporal order (based on the
initial time stamps), and converted the format into the LDC format. All transcripts
were checked against RT-04 formatting standards. The following is a list of RT-04
conventions that are different from those in the transcription guidelines. * Speaker
noise: curly brackets, e.g. {laugh}, instead of angel brackets; * Foreign language:
TEXT instead of TEXT. The Chinese text is not segmented into words, though there are
occasional white spaces within some turns.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fung, Pascale
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T32
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634565
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 281-421-598-631-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Apple Words and Phrases
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Apple Words and Phrases Version 1.3 contains approximately 69.5 hours of speech from
3008 telephone calls placed on analog and digital phone systems. Apple Computer, Inc.
supported the development of this data and also supplied the list of words and phrases
collected. Callers responded to questions and repeated a list of phrases as they were
prompted. *Data* Subjects calling the analog system (998 callers) were employees of
Apple Computer, Inc. and were solicited through interoffice email within the company.
Subjects calling the digital system (2010 callers) were responding to USEnet postings
or newspaper advertisements placed in several cities across the United States. Each
subject called the CSLU data collection system by dialing a toll-free number. The
analog data were collected via a Worldport Pod on an Apple Quadra A/V. The digital
data were collected with the CSLU T1 digital data collection system. Callers were
prompted to answer certain questions including, What is your native language? In which
city and state did you spend most of your childhood? What time is it now? What day
is today? Callers were also instructed to repeat various comnand and control type
phrases, including "play previous message again", "make a meeting for today", "quit",
"who is at work", "what is the area code for this state", "hello, what are my messages",
"help", "please send a car from the city", "delete my email tomorrow", "read this
text", "erase all information", "record extended phonebook", "transfer all calls to
home at twelve o'clock", "record urgent message" and "find the operator". Each recorded
utterance was listened to by a human verifier to determine if the speaker adequately
followed the directions. If an utterance contained extraneous words or excessive noise,
it was not included in the corpus.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Noel, Mike
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633550
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 514-959-558-272-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
RT-04 MDE Training Data Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the MDE RT-04 Training Data Speech, Linguistic
Data Consortium (LDC) catalog number LDC2005S16 and ISBN 1-58563-355-0. This corpus
was created by Linguistic Data Consortium to provide training data for the RT-04 Fall
Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable,
Reusable Speech-to-Text) Program. This data set has been created and distributed by
Linguistic Data Consortium. This data was previously released to the EARS MDE community
as LDC2004E31. The goal of MDE is to enable technology that can take raw Speech-to-Text
output and refine it into forms that are of more use to humans and to downstream automatic
processes. In simple terms, this means the creation of automatic transcripts that
are maximally readable. This readability might be achieved in a number of ways: flagging
non-content words like filled pauses and discourse markers for optional removal; marking
sections of disfluent speech; and creating boundaries between natural breakpoints
in the flow of speech so that each sentence or other meaningful unit of speech might
be presented on a separate line within the resulting transcript. Natural capitalization,
punctuation and standardized spelling, plus sensible conventions for representing
speaker turns and identity are further elements in the readable transcript. LDC has
defined a SimpleMDE annotation task specification and has annotated English telephone
and broadcast news data to provide training data for MDE.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Metadatabases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633585
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 314-507-149-954-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
RT-04 MDE Training Data Text/Annotations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This corpus was created by Linguistic Data Consortium to provide training data for
the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient,
Affordable, Reusable Speech-to-Text) Program. This data set has been created and distributed
by Linguistic Data Consortium. This data was previously released to the EARS MDE community
as LDC2004E31. The goal of MDE is to enable technology that can take raw Speech-to-Text
output and refine it into forms that are of more use to humans and to downstream automatic
processes. In simple terms, this means the creation of automatic transcripts that
are maximally readable. This readability might be achieved in a number of ways: flagging
non-content words like filled pauses and discourse markers for optional removal; marking
sections of disfluent speech; and creating boundaries between natural breakpoints
in the flow of speech so that each sentence or other meaningful unit of speech might
be presented on a separate line within the resulting transcript. Natural capitalization,
punctuation and standardized spelling, plus sensible conventions for representing
speaker turns and identity are further elements in the readable transcript. LDC has
defined a SimpleMDE annotation task specification and has annotated English telephone
and broadcast news data to provide training data for MDE. In this release, some original
annotations contained in LDC2004E31 have been re-mapped to new MDE elements to support
better annotation consistency. In particular, the mapping affects Discourse Responses
(DR), Discourse Markers (DM) and Backchannel SUs (BC). A description of the original
mapping proposed by ICSI appears in 3) below, with complete documentation of the mapping
rules contained in the docs/drmap-discussion directory. The scripts used to apply
the mapping can be found in the docs/scripts/drmap directory.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Metadatabases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shriberg, Elizabeth
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ang, Jeremy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633720
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T28
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 269-933-843-612-1
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T28
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The HARD 2004 Text Corpus was produced by Linguistic Data Consortium (LDC), catalog
number LDC2005T28 and ISBN 1-58563-372-0. This corpus contains source data for the
2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was
a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving
high accuracy retrieval from documents by leveraging additional information about
the searcher and/or the search context, through techniques like passage retrieval
and the use of targeted interaction with the searcher. The current corpus was previously
distributed to HARD Participants as LDC2004E30. The topics and annotations that correspond
to this release are distributed as LDC2005T29, HARD 2004 Topics and Annotations. This
corpus was created with support from the DARPA TIDES Program and LDC. *Data* The corpus
comprises eight English newswire and web text sources from January-December 2003.
The sources are AFE: Agence France Presse - English APE: Associated Press Newswire
CNE: Central News Agency Taiwan - English LAT: Los Angeles Times/Washington Post NYT:
New York Times SLN: Salon.com UME: Ummah Press - English XIE: Xinhua News Agency -
English Volume of data for each source appears in the table below:Source Stories Total
Tokens Avg. Token/Story ----------------------------------------------------------
AFE: 226,515 71,829,978 317 APE: 237,067 93,294,584 393 CNE: 3,674 797,194 217 LAT:
18,287 12,576,721 687 NYT: 28,190 16,673,028 591 SLN: 3,321 4,710,500 1,418 UME: 2,607
782,064 299 XIE: 117,854 24,016,670 203 Total: 637,515 224,680,739Files are organized
by source on a daily basis. Each file contains multiple documents identified by unique
document IDs, in the form "SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number
starting from "0001" for each source/day. In addition, each document has some or all
of the following components: - Keyword (optional), surrounded by tags - Date/time
(optional), surrounded by tags - Headline, surrounded by tags - Main part, surrounded
by tags. Tags are used within this part to identify paragraph boundaries. For more
information please visit the HARD Project website.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Information retrieval
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Metadatabases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T28
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634182
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 396-836-683-088-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT5 Topics and Annotations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the TDT5 Topics and Annotations, Linguistic Data
Consortium (LDC) catalog number LDC2006T19 and isbn 1-58563-418-2. This release includes
topic relevance judgments and associated information for the TDT5 2004 evaluation
topics. This release contains complete relevance judgments, including the results
of adjudication, in which discrepancies between system submissions and LDC annotations
are reviewed and relevance judgments updated. This release also contains answer keys
for the link detection task. The TDT5 corpora were created by Linguistic Data Consortium
with support from the DARPA TIDES (Translingual Information Detection, Extraction
and Summarization) Program. The multilingual news text corresponding to this publication
can be found in LDC Publication LDC2006T18, TDT5 Multilingual News Text. *Data* A
total of 250 topics, numbered 55001 - 55250, were annotated by LDC using a search
guided annotation technique. Details of the annotation process are described in the
annotation task definition. Approximately 25% of the topics are monolingual English
(ENG), 25% are monolingual Mandarin Chinese (MAN), 25% are monolingual Arabic (ARB),
and 25% are multilingual: 63 ENG 62 MAN 62 ARB 35 ARB ENG MAN 21 ENG MAN 7 ARB ENG
250 total Broken down by language and counting both mono- and multi-lingual topics:
126 ENG 118 MAN 104 ARB
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633461
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 513-688-150-766-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Articulation Index
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Articulation Index was developed by the Linguistic Data Consortium (LDC) and was partly
inspired by the work of Harvey Fletcher, who performed a number of perceptual experiments
involving English syllables during the first half of the 20th century. His term articulation
index meant something like perceptual index of syllables, where those syllables were
not necessarily words, and reflected how well speakers could correctly identify syllables
in the presence of noise. This corpus was created to facilitate similar experiments,
as well as to potentially facilitate new methods in speech recognition research. The
basic concept behind the corpus was to record speakers pronouncing syllables of English,
some of which might be real words, but most of which are nonsense syllables. The goal
was to have each speaker say a set of 2,000 syllables common to all speakers, as well
as a set of 20 syllables unique to that speaker. LDC has also released Articulation
Index LSCP (LDC2015S12) *Data* This release contains recordings of 20 American English
speakers (12 males, 8 females) saying 2005 common syllables, 1845 of which are common
to all speakers, and 400 unique syllables (20 syllables/ speaker). The recordings
were made in small, sound-treated anechoic room at LDC. The speakers wore two microphones:
a Sennheiser 410 headset and a Nortel Liberator wireless phone headset. The Sennheiser's
signal traveled through a Symetrix 302 Dual Microphone Preamp, Sony PCM-R300 DAT deck
and Townshend Datlink to a Sun Sparcserver 20 where it was written to disk at 16 KHz,
16-bit, pcm data. The Nortel's signal was transmitted to a wireless base station at
a telephone connected via the network to LDC's telephone recording platform where
it was caputred to disk as 8 KHz, 8-bit, u-law data. The speakers were prompted via
a computer interface that displayed one prompt at a time, allowing them to iterate
through the prompts by pressing a "next" button. Each recording session lasted approximately
15 minutes.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wright, Jonathan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633437
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T30
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 165-794-218-631-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank: Part 4 v 1.0 (MPG Annotation)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T30
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the Arabic Treebank: Part 4 v 1.0 (MPG Annotation),
Linguistic Data Consortium (LDC) catalog number LDC2005T30 and ISBN 1-58563-343-7.
The goal of the Arabic Treebank project is to support the development of data-driven
approaches to natural language processing (NLP), human language technologies, automatic
content extraction (topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research on Modern
Standard Arabic in general, the LDC was sponsored to develop an Arabic POS and Treebank
of 1,000,000 words. This corpus is the fourth part of that project. In this release,
we provide annotation on part of speech (POS), gloss, and word segmentation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mekki, Wigdan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T30
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u hrv d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633593
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S28
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 531-836-688-808-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
hrv
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
hrv
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
West Point Croatian Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S28
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on West Point Croatian Speech, Linguistic Data Consortium
(LDC) catalog number LDC2005S28 and ISBN 1-58563-359-3. West Point Croatian Speech
is a database of digital recordings of spoken Croatian . It was collected by staff
and faculty of the Department of Foreign Languages (DFL) and Center for Technology
Enhanced Language Learning (CTELL) to develop acoustic models for speech recognition
systems. The US government uses These systems to provide speech recognition enhanced
language learning courseware to government linguists and students enrolled in various
government language programs. In addition, parts of this corpus were designed to model
question-answer dialogues for use in domain-specific speech to speech translation
systems. The corpus consists of two subcorpora collected in 2000 and 2001 in Zagreb
Croatia. Informants were recruited from the English department at the University of
Zagreb and the Croatian Military Academy. The 2000 subcorpus consists entirely of
read speach, while the 2001 corpus includes free response answers to questions in
addition to read speech. The read speech in the two subcorpora were elicited from
two different prompt scripts. Each informant in 2000 attempted to read 100 sentences
from a total of 200 carefully designed sentences. These sentences were written by
Christine Tomei. Dr. Tomei's design analysis can be found in the file design-2000.txt.
Informants in 2001 read short text passages extracted from Croatian language webpages.
Thus the scripts used to record read speech contain a total of 6,329 distinct sentences.
The read speech prompts are listed in the files read-200[01].txt in the transcripts
directory. Each line of these files has two fields separated by a tab, the first denoting
the base name of the waveform file, and the second the prompt used in recording the
utterence. The read speech data are stored under the Recordings Croatian directory.
The script used to elicit free response answers contains 143 questions. The text that
was actually presented to the informants is in the file named questions.txt in the
transcripts directory. Data recorded from these prompts are stored in the Answers
Croatian directory. The human-performed transcriptions of the informant's answers
are listed in the answers.txt file in the transcripts directory. Again, each line
of this file has two fields separated by a tab, the first field contains two numbers
separated by a slash. The first number is an identification index for the speaker.
The second number is an index to the question. The second field on the line contains
a word level transcription of the informants's answer to the question indexed by the
second number in the first field. So, for example, in the line: 1/15 eh roena je u
splitu eh roena je u splitu is a transcription of the response speaker one gave to
question 15. The corresponding waveform file is stored in the file 15.wav in the directory
Answers Croatian1. These recordings were transcribed by Milan Sokolich. Mr. Sokoloch
also wrote a pronouncing dictionary that includes grammatical tags. His work is stored
in the file named raw-lexicon.txt. The file lexicon.txt contains a processed version
of the raw-lexicon.txt file. Each speaker in the 2001 subcorpus attempted to record
105 utterances by reading 75 sentences and giving 35 free response answers to 35 questions.
Speech data was collected using Pentium 450 mHz laptop computers running Windows 2000
with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented
a visual display of the sentence to be recorded. The informant pressed a key and spoke
the sentence. The recording was played back for review allowing the utterance to be
re-recorded. A member of the data collection team was on hand during the recording
session to verify recordings and provide technical assistance in case of malfunctioning
equipment.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Croatian. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
LaRocca, Stephen A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tomei, Christine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sokolich, Milan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S28
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633542
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 731-738-468-307-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Proposition Bank 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Proposition Bank 1.0 was produced by Linguistic Data Consortium (LDC) catalog
number LDC2005T23 and ISBN 1-58563-354-2. Chinese Proposition Bank 1.0 is the first
public release of the Penn Chinese Proposition Bank project, which aims to create
a corpus of text annotated with information about basic semantic propositions. Specifically,
predicate-argument relations have been added to the syntactic trees of the first update
to Chinese Treebank 5.0 as an additional layer of annotation. *Data* Chinese Proposition
Bank 1.0 includes annotations for files chtb_001.fid to chtb_931.fid, or the first
250K words of the first update of Chinese Treebank 5.0. There is a total of 37,183
propositions. Auxiliary verbs are not annotated. Some verbs have light verb and non-light
verbs uses and in these cases only the non-light verbs are annotated. All the annotations
in this release are the result of double blind annotation followed by adjudication
of differences. The following table summarizes the framesets in CPB 1.0: Total verbs
framed 4,865 Total framesets 5,298 Verbs with multiple framesets 351 Average framesets
per verb 1.09 *Annotation Format* Each P-A structure is represented in a line of space
separated columns. The columns are as follows ctb-filename sentence terminal tagger
frameset inflection arglabel arglabel ... The content of each column is described
in detail below. ctb-filename the name of the file in the Penn Chinese TreeBank 5.0
update 1 sentence the number of the sentence in the file (starting with 0) terminal
the number of the terminal in the sentence that is the location of the verb. Note
that the terminal number counts empty constituents as terminals and starts with 0.
This will hold for all references to terminal number in this description. An example:
(IP (NP-SBJ (DNP (NP (NN 货币)(NN 回笼))(DEG 的))(NP (NN 增加)))(PU ,) (VP (PP-BNF (P 为)(IP
(NP-SBJ (-NONE- *PRO*))(VP (VV 平抑)(NP-OBJ (NP (DP (DT 全)) (NP (NN 区)))(NP (NN 物价))))))(VP
(VV 发挥)(AS 了)(NP-OBJ (NN 作用)))) (PU 。)) The terminal numbers: 货币 0 回笼 1 的 2 增加 3 ,4
为 5 *PRO* 6 平抑 7 全 8 区 9 物价 10 发挥 11 了 12 作用 13 。14 tagger the name of the annotator,
or "gold" if it's been double annotated and adjudicated. frameset The frameset identifier
from the frames file of the verb. For example, '发挥.01' refers to the frameset ID "f1"
in the frame file for the verb '发挥' (frames/0930-fa-hui.xml). The names of the frame
files are composed of numerical id, plus the pinyin of the verb. The numerical ids
can be found in the enclosed verb list (verbs.txt). inflection The inflection field
is a carry-over from the Penn English Proposition Bank, and is set to '-----', meaning
no annotation in the Chinese Proposition Bank. arglabel A string representing the
annotation associated with a particular argument or adjunct of the proposition. Each
arglabel is dash '-' delimited and has the following columns 1) column for the address
of a constituent The address of the constituent are in one of the two forms. form
1: : A single node in the syntactic tree of the sentence in question, identified by
the first terminal the node spans together with the height from that terminal to the
syntax node (a height of 0 represents a terminal). For example, in the sentence (IP
(NP-TPC (DP (DT 这些))(CP (WHNP-1 (-NONE- *OP*)) (CP (IP (NP-SBJ (-NONE- *T*-1)) (VP
(ADVP (AD 已))(VP (VV 开业))))(DEC 的)))(NP (NN 外商)(NN 投资)(NN 企业))) (NP-ADV (NN 绝大部分))(NP-SBJ
(NN 生产)(NN 经营)(NN 状况))(VP (ADVP (AD 较)) (VP (VA 好)))(PU 。)) the address of "1:3" represents
the top IP node and "2:2" represents the CP node form 2: terminal number:height*terminal
number:height*... A trace chain identifying coreference within sentence boundaries.
For example in the sentence (IP (NP-TPC (DP (DT 这些))(CP (WHNP-1 (-NONE- *OP*)) (CP
(IP (NP-SBJ (-NONE- *T*-1)) (VP (ADVP (AD 已))(VP (VV 开业))))(DEC 的)))(NP (NN 外商)(NN
投资)(NN 企业))) (NP-ADV (NN 绝大部分))(NP-SBJ (NN 生产)(NN 经营)(NN 状况))(VP (ADVP (AD 较)) (VP
(VA 好)))(PU 。)) the address of of "2:0*1:0*6:1" represents the fact nodes '2:0' (-NONE-
*T*-1), '1:0' (-NONE- *OP*) and '6:1' (NP (NN 外商)(NN 投资)(NN 企业)) are coreferential.
form 3: terminal number:height,terminal number:height,... This represents a collection
of different pieces of one argument. This form is rarely used in the annotation of
the verbs, since most discontinuous constituents have well-defined relations between
their components. Therefore the components of a discontinuous constituent are assigned
the same label with a secondary tag representing their semantic relations. For example,
if a constituent is marked as ARG0-CRD, it means that there is another constituent
having the same label and together they fill the ARG0 role of the verb. 2) column
for the 'label' The argument label one of {rel, ARGM} + { ARG0, ARG1, ARG2, ... }.
The argument labels correspond to the argument labels in the frames files (see ./frames).
ARGM for adjuncts of various sorts, and 'rel' refers to the surface string of the
verb. 3) column for 'functional tag' (optional for numbered arguments; required for
ARGM) Functional tags for "split" numbered arguments: PSR - possessor PSE - possessee
CRD - coordinator PRD - predicate QTY - quantity Propositional tags for numbered arguments:
AT, AS, INTO, TOWARDS, TO, ONTO Functional tags for ARGM: ADV - adverbial, default
tag BNF - beneficiary CND - conditional DIR - directional DIS - discourse DGR - degree
EXT - extent FRQ - frequency LOC - location MNR - manner NEG - negation PRP - purpose
and reason TMP - temporal TPC - topic
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
- Geographic subdivision:
China
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jiang, Zixin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chang, Meiyu
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633496
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S30
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 739-195-943-085-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
West Point Company G3 American English Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S30
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
During the 2000-2001 academic year, cadets, staff and faculty members at the United
States Military Academy volunteered to participate in a speech data collection project
for American English. The goal of the project was to amass recordings from no less
than 100 adult speakers (50 males and 50 females) to form a substantial corpus of
high-quality read speech. The project was conducted by the Center for Technology Enhanced
Language Learning, part of the U.S. Military Academy's Department of Foreign Languages.
Many of the 100-plus volunteers who provided the recordings were members of the staff
and faculty of the Department of Foreign Languages. Other volunteers were friends
and colleagues from other organizations who worked in offices in Washington Hall.
The largest group of volunteers was from Cadet Company G, Third Regiment, United States
Corps of Cadets. Cadet Company G3, encouraged by their tactical officer, Major Scott
Custer, adopted the speech data collection effort as a community service project.
Every female cadet in Company G3 recorded her voice, as did many of the male cadets,
including the cadet company commander and Major Custer. The 185 sentences comprising
the data collection script were written to elicit examples of all or most all of the
possible syllables used in spoken American English. The G3 Corpus audio data comes
from 53 female and 56 male volunteers, each of whom recorded approximately 104 utterances.
The recordings are sampled at a 16-bit resolution, 22,050 samples per second. Recordings
were made using headset microphones (Shure M10) with preamplifiers attached to the
line input jack of desktop computers. The total amount of speech is about 15 hours.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morgan, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
LaRocca, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bellinger, Sherri
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ruscelli, Charles (Chip)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S30
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633682
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T34
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 410-883-638-016-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese <-> English Name Entity Lists v 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T34
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for Chinese English Name Entity Lists, v.1.0, Linguistic
Data Consortium (LDC) catalog number LDC2005T34 and ISBN 1-58563-368-2. These Chinese-English
bi-directional name entity lists are compiled from Xinhua News Agency newswire texts.
Not every irregularity in the original source has been detected and normalized. Some
Chinese characters are not encoded in the source and brackets are used to describe
their composition. Except for the person name lists, most instances were left untouched
in the created lists. An effort was made to replace GB-encoded characters (such as
Roman numbers) in the English translation with ASCII characters. However no attempt
has been made to do the opposite for Chinese names. The use of slashes as delimiters
presents another problem. Some names may have internal slashes. Initially, double
quotes ("") were used to enclose the name with an internal slash to avoid confusion
without realizing that these is just one " in ASCII (as opposed to a set of enclosing
" in GB). Later it was decided to use &slash;. In future releases, some lists will
be changed for greater consistency. Finally, most of the English names in the source
use lower cases throughout. An effort was made to capitalize the initial letter (and
possibly some middle ones) for person names, but not for any other kind of names as
most other names have multiple words, some of which may contain articles and prepositions.
The word "English" is somewhat misleading here. Although most of the foreign words
are English or can appear in English texts, there are also many non-English words
written in Roman alphabet - some of which may have English equivalents while others
do not. No efforts have been made to eliminate those non-English names where English
equivlants are available. The entire set consists of nine pairs of lists. The English->Chinese
version of each pair was created by reversing the Chinese->English, both sorted by
the Unix built-in sort function. The contents are as follows * Place Names, Chinese
to English: 276,382 * Place Names, English to Chinese: 298,993 * Organization Names,
Chinese to English: 30,800 * Organization Names, English to Chinese: 37,145 * Corporate
Names, Chinese to English: 54,747 * Corporate Names, English to Chinese 58,468 * Press
Organization Names, Chinese to English: 29,757 * Press Organization Names, English
to Chinese: 32,922 * Intl. Organization Names, Chinese to English: 7,040 * Intl. Organization
Names, English to Chinese: 7,040
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T34
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633569
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S26
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
swa
- Language code of text/sound track or separate title:
swe
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
pol
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
ind
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ger
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
swa
- Language code of text/sound track or separate title:
swa
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
hun
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
swh
- Language code of text/sound track or separate title:
swe
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
pol
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
ind
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
deu
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
swa
- Language code of text/sound track or separate title:
swc
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
hun
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
pes
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: 22 Languages Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S26
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the CSLU: 22 Languages v 1.2, Linguistic Data
Consortium (LDC) catalog number LDC2005S26 and ISBN 1-58563-361-5. Produced by Center
for Spoken Language Understanding and distributed by the Linguistic Data Consortium,
the 22 Languages corpus consists of telephone speech from 21 languages: Eastern Arabic,
Cantonese, Czech, Farsi, German, Hindi, Hungarian, Japanese, Korean, Malay, Mandarin,
Italian, Polish, Portuguese, Russian, Spanish, Swedish, Swahili, Tamil, Vietnamese,
and English. The corpus contains fixed vocabulary utterances (e.g. days of the week)
as well as fluent continuous speech. Each of the 50,191 utterances is verified by
a native speaker to determine if the caller followed instructions when answering the
prompts. For this release, approximately 19,758 utterances have corresponding orthographic
transcriptions in all the above languages except Eastern Arabic, Farsi, Korean, Russian,
Italian.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese, Vietnamese, Tamil, Swahili (individual language), Swedish,
Russian, Portuguese, Polish, Korean, Japanese, Indonesian, Hindi, English, German,
Arabic, Swahili, Congo Swahili, Spanish, Mandarin Chinese, Italian, Hungarian, Persian,
Dari, Iranian Persian, and Czech. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lander, T.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S26
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633712
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 299-814-033-635-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Gigaword Second Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Gigaword Second Edition was produced by Linguistic Data Consortium (LDC) catalog
number LDC2006T02 and ISBN 1-58563-371-2. This is a comprehensive archive of newswire
text data that has been acquired from Arabic news sources by the Linguistic Data Consortium
(LDC), at the University of Pennsylvania. Arabic Gigaword Second Edition includes
all of the content of the first edition of Arabic Gigaword (LDC2003T12) as well as
new data. Five distinct sources of Arabic newswire are represented here: Agence France
Presse (afp_arb; formally afa) Al Hayat News Agency (hyt_arb; formally alh) An Nahar
News Agency (nhr_arb; formally ann) Ummah Press (umh_arb) Xinhua News Agency (xin_arb;
formally xia) The seven-letter codes in the parentheses above consist of the three-character
source name IDs and the three-character language code ("arb") separated by an underscore
("_") character. The three-letter language code represents the standard Arabic in
the ISO 639-3 standard. In the first edition of the Arabic Gigaword corpus, a simpler
three-character-code scheme was used to identify both the source and the language.
The new convention allows us to distinguish data sets by source and language more
naturally when a single newswire provider distributes data in multiple languages.
Ummah Press is a new source added to the Second Edition. The following table shows
the new data that appear for the first time in the Second Edition. Agence France Presse
2003.01-2004.12 143,766 documents Al Hayat News Agency 2002.01-2003.12 64,308 documents
An Nahar News Agency 2003.01-2004.01 16,316 documents Ummah Press 2003.01-2004.12
4,641 documents Xinhua News Agency 2003.06-2004.12 10,6236 documents *Data* There
are 423 files, totaling approximately 1.4GB in compressed form (5,359 MB uncompressed,
and 1,591,983 K-words). The table below presents the following categories of information:
source of the data, number of files per source, Gzip-MB shows totals for compressed
file sizes, Totl-MB shows totals for uncompressed file sizes (i.e. approximately 5.3
gigabytes total), K-words are the number of space-separated tokens in the text, excluding
SGML tags. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFP_ARB 128 355 1429 123594
660621 HYT_ARB 119 524 1861 169100 369555 NHR_ARB 109 457 1649 151078 344084 UMH_ARB
24 4 13 1201 4645 XIN_ARB 43 103 407 36933 213082 TOTAL 423 1443 5359 481906 1591987
All text files in this corpus have been converted to UTF-8 character encoding. Owing
to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character
(ASCII) text, whereas lines of actual text data, including article headlines and datelines,
contain a mixture of single-byte and multi-byte characters. In general, single-byte
characters in the text data will consist of digits and punctuation marks (where the
original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation),
whereas multi-byte characters consist of Arabic letters and a small number of special
punctuation or other symbols. This variable-width character encoding is intrinsic
to UTF-8, and all UTF-8 capable processes will handle the data appropriately. Each
data file name consists of the seven-letter prefix, an underscore character ("_"),
and a six-digit date (representing the year and month during which the file contents
were generated by the respective news source), followed by a ".gz" file extension,
indicating that the file contents have been compressed using the GNU "gzip" compression
utility (RFC 1952). Therefore, each file contains all the usable data received by
LDC for the given month from the given news source. All text data are presented in
SGML form, using a very simple, minimal markup structure. The file gigaword_a.dtd
in the "dtd" directory provides the formal "Document Type Declaration" for parsing
the SGML content. The corpus has been fully validated by a standard SGML parser utility
(nsgmls), using this DTD file. Unlike older corpora, the present corpus uses only
the information structure that is common to all sources and serves a clear function:
headline, dateline, and core news content (usually containing paragraphs). All sources
have received a uniform treatment in terms of quality control, and have been categorized
into three distinct "types": story this type of DOC represents a coherent report on
a particular topic or event, consisting of paragraphs and full sentences multi this
type of DOC contains a series of unrelated "blurbs," each of which briefly describes
a particular topic or event: "summaries of today's news," "news briefs in ... (some
general area like finance or sports)" and so on other these DOCs clearly do not fall
into any of the above types; these are things like lists of sports scores, stock prices,
temperatures around the world, and so on The general strategy for categorizing DOCs
into these three classes was, for each source, to discover the most common and frequent
clues in the text stream that correlated with the "non-story" types. When none of
the known clues was in evidence, the DOC was classified as a "story." Other "Gigaword"
corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"),
which applied to DOCs that contain text intended solely for news service editors,
not the news-reading public. In preparing the Arabic data, the task of determining
patterns for assigning "non-story" type labels was carried out by a native speaker
of Arabic, and (for whatever reason) this person did not find the "advis" category
to be applicable to any of the data. As described in the introduction section, a new
naming scheme for file names and document IDs is used in the Second Edition. All of
the documents in the first edition of the Arabic Gigaword corpus can be mapped to
the same documents in this edition by changing the prefix of DOC IDs and file names
as below. The upper case letters are used for the DOC IDs; the lower case letters
are used for the file and directory names. The underscore character to connect the
seven-letter prefix and the date is included in the following table. Old New AFA AFP_ARB_
ALH HYT_ARB_ ANN NHR_ARB XIA XIN_ARB_
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
History, Modern
- Form subdivision:
Databases.
- Chronological subdivision:
1989-
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u spa d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 683-827-849-463-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Spanish Gigaword First Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the Spanish Gigaword First Edition, Linguistic
Data Consortium (LDC) catalog number LDC2006T12 and ISBN 1-58563-393-3. The Spanish
Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired
over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania.
This is the first edition of the Spanish Gigaword Corpus, though some of the data
included here has been released previously in other LDC corpora. The three distinct
international sources of Spanish newswire in this edition, and the time spans of collection
covered for each, are as follows: * Agence France-Presse, Spanish Service (afp_spa)
May 1994 - Dec 2005 * Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec
2005 * Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2005 The seven-letter
codes in the parentheses above include the three-character source name abbreviations
and the three-character language code ("spa") separated by an underscore ("_") character.
The three-letter language code conforms to LDC's new internal convention based on
the new ISO 639-3 standard. The seven-letter codes are used in both the directory
names where the data files are found, and in the prefix that appears at the beginning
of every data file name. It is also used (in all UPPER CASE) as the initial portion
of the DOC "id" strings that uniquely identify each news story. *Data* The overall
totals for each source are summarized below. Note that the "Totl-MB" numbers show
the amount of data you get when the files are uncompressed (i.e. approximately 5 gigabytes,
total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the
DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens
(of all types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds
#DOCs AFP_SPA 139 926 2731 393354 1382679 APW_SPA 144 600 1806 263225 886998 XIN_SPA
52 212 648 94459 388561 TOTAL 335 1738 5185 751038 2658238 The following tables present
"Text-MB", "K-wrds" and "#DOCS" broken down by source and DOC type; "Text-MB" represents
the total number of characters (including whitespace) after SGML tags are eliminated.
Text-MB K-wrds #DOCs type="advis": AFP_SPA 40 15505 40580 APW_SPA 11 6173 11112 XIN_SPA
0 0 0 TOTAL 51 21678 51692 type="multi": AFP_SPA 12 10282 12514 APW_SPA 30 12519 30892
XIN_SPA 32 17773 32463 TOTAL 74 40574 75869 type="other": AFP_SPA 126 28305 126530
APW_SPA 153 39038 153932 XIN_SPA 26 3325 26828 TOTAL 305 70668 307290 AFP_SPA 2166
339271 1202785 APW_SPA 1287 205501 691062 XIN_SPA 463 73360 329270 TOTAL 3916 618132
2223117
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Spanish language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Information retrieval.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633879
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 021-421-953-520-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English-Arabic Treebank v 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the English-Arabic Parallel Treebank v 1.0 , Linguistic
Data Consortium (LDC) catalog number LDC2006T10, ISBN 1-58563-387-9. This release
of the English-Arabic Treebank consists of 52,238 words in 224 files of individual
Agence France Presse (AFP) news stories (corresponding to approximately the first
50K words of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02, ISBN:
1-58563-330-5). The English translation was provided by LDC, and was part-of-speech
tagged and treebanked for this project. *Data* The guidelines followed for both part-of-speech
and treebank annotation are essentially Penn Treebank II style, with two notable differences:
* POS: tokenization of hyphenated items ("New York-based" has been replaced by "New
York - based" for example), and the addition of HYPH and AFX tags necessitated by
this change in tokenization * TreeBank: the addition of the node label NML for sub-NP
nominal constituents (replacing NX and most NP-internal NAC)
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633690
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005T35
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 797-978-576-065-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
American National Corpus (ANC) Second Release
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005T35
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the ANC Second Release, Linguistic Data Consortium
(LDC) catalog number LDC2005T35 and ISBN 1-58563-369-0. The American National Corpus
(ANC) project fosters the development of a corpus comparable to the British National
Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that
the BNC is inappropriate for the study of American English, due to the numerous differences
in use of the language. The availability of a corpus of American English will significantly
contribute to language and linguistic research, the development of language understanding
computer applications (e.g., language translation and search and retrieval software),
and the compilation of reference works such as dictionaries and thesauri. It will
also provide a rich national resource for use in education at all levels. ANC Second
Release contains over 20 million words: 10+ million words added in the Second Release,
and a new corrected and validated version of the 11 million word ANC First Release.
The Second Release also contains software for searching and retrieving multiple stand-off
annotations. ANC Second Release contains texts from the following sources (* denotes
new source in the Second Release): * Transcribed telephone speech (LDC and Project
MORE) * The New York Times * Berlitz Travel Guides (Langensheidt Publishers) * Slate
Magazine (Microsoft) * ICIC Corpus of Fundraising Texts (Indiana Center for Intercultural
Communication)* * The Michigan Corpus of Academic Spoken English (MICASE) (University
of Michigan, English Language Institute)* * Various non-fiction * Various fiction
(Orin Hargraves, Ferd Eggan)* * Various medical research articles (BioMed Central,
Public Library of Science)* * Anonymized posts to the Phoenix Board/Buffistas.org*
ANC Second Release contains data governed under two types of licenses, an open license
and a restricted license. Both the Open License Agreement and the Restricted License
Agreement need to be signed in order to receive ANC Second Release, and the data must
be used in acordance with the agreement by which it is governed. The ANC will ultimately
contain a core corpus of at least 100 million words, including both written and spoken
(transcripts) data comparable across genres to the BNC. The genres in the ANC will
be expanded to include new types of language data that have become available in recent
years, such as web blogs and web pages, chats, email, and rap music lyrics. In addition
to the core 100 million words, the ANC will include an additional component of potentially
several hundreds of millions of words, chosen to provide both the broadest and largest
selection of data possible. The American National Corpus is being developed with the
help of consortium of publishers of American English dictionaries and companies with
interests in language processing was formed in 1999. Consortium members are providing
materials for inclusion in the corpus, and provided initial financial support for
the project. Additional documentation and information is available at the ANC web
site at http://www.americannationalcorpus.org/SecondRelease/index.html.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Reppen, Randi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ide, Nancy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Suderman, Keith
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005T35
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633747
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 815-941-649-807-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Korean Propbank Annotations is a semantic annotation of the Korean English Treebank
Annotations and Korean Treebank Version 2.0. Each verb and adjective occurring in
the Treebank has been treated as a semantic predicate and the surrounding text has
been annotated for arguments and adjuncts of the predicate. The verbs and adjectives
have also been tagged with coarse grained senses. This work was done in the Computer
and Information Sciences Department at the University of Pennsylvania. The XML format
and KSC 5,601 character set encoding are used in the frames file. *Data* There are
two basic components to Korean Propbank: * The Verb Lexicon. A frames file, consisting
of one or more frame sets, has been created for each predicate occurring in the Treebank.
These files serve as a reference for the annotators and for users of the data. 2,749
such files have been created, totaling about ~10 MB of uncompressed data. * The Annotation.
There are two annotation files. The virginia-verbs.pb file has 9,588 annotated predicate
tokens. These predicate tokens include all those occurring in over 54,000 words of
the Korean English Treebank Annotations, totaling ~791 KB of uncompressed data. The
newswire-verbs.pb file has 23,707 annotated predicate tokens. These predicate tokens
include all those occurring in over 131,000 words of the Korean Treebank Version 2.0,
totaling ~2,054 KB of uncompressed data.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Discourse analysis
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Semantics
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ryu, Shijong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Choi, Jinyoung
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yoon, Sinwon
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jeon, Yeongmi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633755
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 018-899-448-641-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multiple-Translation Chinese (MTC) Part 4
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multiple-Translation Chinese (MTc) Part 4, Linguistic Data Consortium (LDC) catalog
number LDC2006T04 and ISBN 1-58563-375-5, was developed by LDC. To support the development
of automatic means for evaluating translation quality, LDC was sponsored to solicit
four sets of human translations for a single set of Chinese source materials. LDC
was also asked to produce translations from various commercial-off-the-shelf-systems
(COTS, including commercial Machine Translation (MT) systems as well as MT systems
available on the Internet). There are a total of five sets of COTS outputs and six
output sets from TIDES 2003 MT Evaluation participants. To determine if automatic
evaluation systems, such as BLEU, track human assessment, LDC also performed human
assessments on one COTS output and the six TIDES research systems. The corpus includes
the assessment results for one of the five COTS systems, the assessment results for
the six TIDES research systems, and the specifications used for conducting the assessments.
*Data* Source Data Selection Two sources of journalistic Chinese text were selected
to provide the Chinese material: - Xinhua News Agency (Xinhua): 50 news stories -
Agence France Presse (AFP): 50 news stories (total: 100 stories) There are 100 source
files and 1,100 translation files. All source data were drawn from LDC's January and
February 2003 collection of Xinhua Chinese data and AFP Chinese data. The story selection
from the two newswire collections was controlled by story length: all selected stories
contain between 280 and 605 Chinese characters. The overall count of Chinese words
(excluding markup), by source, is shown in the following table: AFP 22,450 Xinhua
19,650 ------------- 42,100 For the Chinese data, there are approximately 21K-words,
while for the English translations, there are 396K-words in total and 16K unique words.
Source Data Preparation for Human Translation The original source files used GB-2312
encoding for the Chinese characters, and SGML tags for marking sentence and paragraph
boundaries and other information about each story. The character encoding is unaltered.
To facilitate translation, nearly all sgml tags were removed or replaced by "plain
text" markers. Specifically, each story was presented to the human translators in
the following format: --Segment 1-- {Chinese text to be translated} --Segment 2--
{Chinese text to be translated} --Segment 3-- {Chinese text to be translated} ...
Each --Segment-- corresponds to a Chinese sentence. The rationale for using the term
"segment" instead of "sentence" was to discourage the translators from inserting additional
"-Sentence-" markers if a Chinese sentence was translated into two or more English
sentences. The markers were intended to assure that the resulting translations would
be easily alignable to the source texts, so extra care was taken to ensure that they
would be kept intact and properly oriented. Some normalization was performed on all
files to conform to the above format, including splitting long segments into smaller
chunks and adding segment markers. As a last step, all files were converted from UNIX-style
line termination (new-line only) to MS-DOS-style (carriage-return plus line-feed)
on the assumption that most (possibly all) translators would use MS-Windows-based
editors. Human Translation Procedure and Quality Assessment Each initially selected
translation team received the translation guidelines and a sample pair of source and
translation (excluded from the final release) for review. After the team indicated
that they understood the task requirements and would be willing to participate in
the project, 100 news stories were sent to them. Each translation team returned the
first five AFP stories for quality checking to ensure that the team was following
the guidelines and that the translation quality was acceptable. LDC returned translations
to the translation team for any deviations from the guidelines or for quality issues
detected. Subsequent translation submissions were continuously monitored for conformance
and quality. Once the full set of translations was complete, a final pass of reformatting
and validation was carried out to assure alignability of segments and to convert the
translated texts into SGML format. Each translation team was also asked to complete
and return a questionnaire to describe their procedures and professional background.
Machine Translation Procedure Complete sets of automatic MT translations were also
produced by submitting the 100 stories to each of the five publicly-available MT systems.
Starting from the original SGML text format, special alterations were made to the
files on an as-needed basis, so that they would be accepted and handled correctly
by the various systems. Also, the systems differed in terms of the input and retrieval
methods required to submit the source data for translation and to save the translated
text in alignable form. Human Assessment Procedure The goal of this effort was to
evaluate the quality of TIDES research, human translation teams and commercial off-the
shelf (COTS) systems. Translations were evaluated on the basis of adequacy and fluency.
Adequacy refers to the degree to which the translation communicates information present
in the original source language text. Fluency refers to the degree to which the translation
is well-formed according to the grammar of the target language. Final Data Format
and Validation For the present release, the corpus content is organized into source
and translation directories. Within translation there is a separate subdirectory for
each translation service or system, identified as follows: Human translators: E01
E02 E03 E04 COTS systems: E05 E06 E07 E08 E09 Research systems: E11 E12 E14 E15 E17
E22 The source directory and each of the human and COTS translation subdirectories
contain 100 files with one news story per file. Corresponding file names are identical
across all directories, consisting of "docid.sgm." Within each source file, the content
is formatted in SGML as follows: [Chinese text in GB-2312 character encoding] [Chinese
text in GB-2312 character encoding] ... Ranking of Manual Translations Ranking of
manual translations was performed by two LDC staff members, one a Chinese-dominant
bilingual and the other an English native monolingual. There was overall agreement
on the ranking between the two and minor discrepancies were resolved through discussion
and comparison of additional files. The ranking for the manual translations is: best-----------------------------worst
E01 > E02 > E03 > E04 > The ranking method was unstructured and somewhat casual --
it is not intended to be definitive, or even accountable.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633631
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 960-768-408-027-3
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Voices Corpus was created by Alexander Kain for his Ph.D. dissertation work on
high resolution voice transformation. The corpus contains 12 speakers reading 50 phonetically
rich sentences. The recording procedure involved a "mimicking" approach which resulted
in a high degree of natural time-alignment between different speakers. The acoustic
wave and the concurrent laryngograph signal were recorded for one "free" and two "mimicked"
renditions of each sentence. Pitch marks, calculated from the laryngograph signal,
and time marks, the output of a forced-alignment algorithm, have been added to the
corpus.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech synthesis
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kain, Alexander
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633763
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 458-031-085-383-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ACE 2005 Multilingual Training Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This publication contains the complete set of English, Arabic and Chinese training
data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus
consists of data of various types annotated for entities, relations and events was
created by Linguistic Data Consortium with support from the ACE Program, with additional
assistance from LDC. This data was previously distributed as an e-corpus (LDC2005E18)
to participants in the 2005 ACE evaluation. The objective of the ACE program is to
develop automatic content extraction technology to support automatic processing of
human language in text form. In November 2005, sites were evaluated on system performance
in five primary areas: the recognition of entities, values, temporal expressions,
relations, and events. Entity, relation and event mention detection were also offered
as diagnostic tasks. All tasks with the exception of event tasks were performed for
three languages, English, Chinese and Arabic. Events tasks were evaluated in English
and Chinese only. The current publication comprises the official training data for
these evaluation tasks. A complete description of the ACE 2005 Evaluation can be found
on the ACE Program website maintained by the National Institute of Standards and Technology
(NIST). For more information about linguistic resources for the ACE Program, including
annotation guidelines, task definitions, free annotation tools and other documentation,
please visit LDC's ACE website Below is information about the amount of data included
in the current release and its annotation status. * 1P: data subject to first pass
(complete) annotation * DUAL: data also subject to dual first pass (complete) annotation
* ADJ: data also subject to discrepancy resolution/adjudication * NORM: data also
subject to TIMEX2 normalization English words files 1P DUAL ADJ NORM 1P DUAL ADJ NORM
NW 60658 57807 33459 48399 128 124 81 106 BN 59239 58144 52444 55967 239 234 217 226
BC 46612 46110 33874 40415 68 67 52 60 WL 45210 43648 35529 37897 127 122 114 119
UN 45161 44473 26371 37366 58 57 37 49 CTS 47003 47003 34868 39845 46 46 34 39 Total
303833 297185 216545 259889 666 650 535 599 Chinese Note: Chinese data expressed in
terms of characters. We assume a correspondence of roughly 1.5 characters/word. chars
files 1P DUAL ADJ 1P DUAL ADJ NW 127319 124175 121797 248 242 238 BN 134963 133696
120513 332 328 298 WL 71839 68063 65681 107 101 97 Total 334121 325834 307991 687
671 633 Arabic words files 1P DUAL ADJ 1P DUAL ADJ NW 61287 56158 53026 239 226 221
BN 29259 27165 26907 134 128 127 WL 21687 20181 20181 60 55 55 Total 112233 103504
100114 433 409 403
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, Standard Arabic, and English. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content analysis (Communication)
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Medero, Julie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633860
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 717-712-373-266-4
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TimeBank 1.2 contains 183 news articles that have been annotated with temporal information,
adding events, times and temporal links between events and times. The annotation follows
the TimeML 1.2.1 specificationavailable at www.timeml.org. *Data* TimeML aims to capture
and represent temporal information. This is accomplished using four primary tag types:
TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals,
and LINK for representing relationships. For a detailed description of TimeML, see
the TimeML 1.2.1 Specification and Guidelines. Here, we give a summary of each tag.
TIMEX3. This tag is used to capture dates, times, durations, and sets of dates and
times. All TIMEX3 tags include a type and a value along with some other possible attributes.
The value is given according to the ISO 8601 standard. The TIMEX3 tag allows specification
of a tempral anchor. This facilitates the use of temporal functions to calculate the
value of an underspecified temporal expression. For example, an article might include
a document creation time such as "January 3, 2006." Later in the article, the temporal
expression "today" may occur. By anchoring the TIMEX3 for "today" to the document
creation time, we can determine the exact value of the TIMEX3. EVENT. The EVENT tag
is used to annotate those elements in a text that mark the semantic events described
by it. Any event that can be temporally anchored or ordered is captured with this
tag. An EVENT includes a class attribute with values such as occurrence, state, or
reporting. The class of an EVENT may indicate what relationships the event participates
in. In addition to the EVENT tag, events are also annotated with one or more MAKEINSTANCE
tags that include information about a particular instance of the event. This includes
part of speech, tense, aspect, modality, and polarity. When an event participates
in a relationship, it is actually the event instance that is referenced. This is to
allow for statements such as "John taught on Monday but not on Tuesday." Here, there
are actually two instances of the teaching-event: one that has a positive polarity
and one that is negative. Further, each instance participates in its own temporal
relationship with respect to "Monday" and "Tuesday." SIGNAL. The SIGNAL tag is used
to annotate temporal function words such as "after," "during," and "when." These signals
are then used in the representation of a temporal relationship. The following three
tags are link tags. They capture temporal, subordination, and aspectual relationships
found in the text. These tags do not consume any actual text, but they do relate the
three tag types above to each other. TLINK. Temporal links are represented with a
TLINK tag. A TLINK can temporally relate two temporal expressions, two event instances,
or a temporal expression and an event instance. Along with an identification marker
for each of these two elements, a relation type is given such as before, includes,
or ended by. When a signal is present that helps to define the relationship, an ID
for the SIGNAL is given as well. SLINK. This tag is used to capture subordination
relationships that involve event modality, evidentiality, and factuality. An SLINK
includes an event instance ID for the subordinating event and an event instance ID
for the subordinated event. Possible relation types for SLINK include modal, evidential,
and factive. An SLINK will typically not include a signal ID unless it has the relation
type conditional. Three specific EVENT classes interact with SLINK: reporting, i_state,
and i_action. ALINK. An aspectual connection between two event instances is represented
with ALINK. As with SLINK, this tag includes two event instance IDs, one that introduces
the ALINK and one that is the event argument to that event. The introducing event
has the class aspectual. Some possible relation types for ALINK are initiates, terminates,
and continues. TimeBank 1.2 contains 183 articles with just over 61,000 non-punctuation
tokens. The count for each TimeML tag is listed below: EVENT 7935 MAKEINSTANCE 7,940
TIMEX3 1,414 SIGNAL 688 ALINK 265 SLINK 2,932 TLINK 6,418 Total 27,592
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Space and time in language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine learning.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pustejovsky, James
ADDED ENTRY--PERSONAL NAME
- Personal name:
Verhagen, Marc
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sauri, Roser
ADDED ENTRY--PERSONAL NAME
- Personal name:
Littman, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gaizauskas, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Katz, Graham
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mani, Inderjeet
ADDED ENTRY--PERSONAL NAME
- Personal name:
Knippen, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Setzer, Andrea
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633771
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S29
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 703-206-088-436-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Levantine Arabic QT Training Data Set 5, Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S29
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Levantine Arabic QT Training Data Set 5, Speech contains 1,660 calls totalling approximately
250 hours of telephone conversation in Levantine Arabic. These calls were collected
between 2003 and 2005. Corresponding transcriptions may be found in LDC2006T07. *Data*
This corpus is the combination of four former training data sets: LDC2004E21 and LDC2004E22,
LDC2004E65 and LDC2004E66, LDC2005S07 and LDC2005T03 and LDC2005S14 (Speech and Transcripts).
More than half of the speakers are Lebanese, the others are Jordanian, Palestinian,
and Syrian. The table below shows the distribution of the speakers' national origin:
* 559 Jordanian * 1,853 Lebanese * 355 Palestinian * 67 Syrian * 484 Levantine speakers
whose national origin could not be determined.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Levantine Arabic and South Levantine Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S29
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u ara d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 491-775-257-365-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Levantine Arabic QT Training Data Set 5, Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Levantine Arabic QT Training Data Set 5, Transcripts, contains transcripts for approximately
250 hours (1,660 calls) of Arabic Levantine telephone conversation collected between
2003 and 2005. The correspnding speech files may be found in LDC2006S29. This corpus
is the combination of four former training data sets: LDC2004E21 and LDC2004E22, LDC2004E65
and LDC2004E66, LDC2005S07 and LDC2005T03 and LDC2005S14(Speech and Transcripts).
More than half of the speakers are Lebanese, the others are Jordanian, Palestinian,
and Syrian. The table below shows the distribution of the speakers' national origin:
* 559 Jordanian * 1,853 Lebanese * 355 Palestinian * 67 Syrian * 484 Levantine speakers
whose national origin could not be determined. All transcription files are encoded
in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Levantine Arabic and South Levantine Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633801
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S30
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 185-835-412-868-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Speech Controlled Computing
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S30
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on Speech Controlled Computing, Linguistic Data Consortium
(LDC) catalog number LDC2006S30 and ISBN 1-58563-380-1. The Speech Controlled Computing
corpus was designed to support the development of small footprint, embedded ASR applications
in the domain of voice control for the home. It consists of the recordings of 125
speakers of American English from four dialect regions, three age groups and two gender
groups, pronouncing isolated words. The four primary dialect regions covered by the
corpus are North, South, West and Midland as defined by Williams Labov's Atlas of
North American English. The three primary age groups covered by the corpus are 18-29,
30-49 and 50+. The recordings were conducted in a sound-attenuated room at LDC with
the AKG C4000B studio condenser microphone. The omni-directional mode of the C4000B
was used. Each speaker read a randomized word list consisting of 2,100 words (100
distinct words appearing 21 times each). Speech utterances were digitized and recorded
to a DAT, as well as to a hard disk drive via the Townshend DATLINK+ digital audio
interface. Speech utterances were audited as they were recorded, and any utterances
detected by the recorder that were not spoken clearly or correctly were re-recorded.
This included extraneous clicks, coughs, sighs and breathing that may have corrupted
the recorded words. Utterances that were spoken too soft or too loud were also re-recorded.
The digitized utterances were automatically segmented and aligned to the word list.
Then each utterance was audited and the segmentation was checked, and corrected if
necessary, by an annotator using an auditing and segmenting tool developed by LDC.
Finally, sound files containing individual utterances were generated using the alignment
and segmentation information. The sound files for this corpus were created with 100
msec of silent time before and after each utterance. Any files that contained noticeable
clipping were automatically removed.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii O.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S30
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u kor d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 365-025-522-700-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Korean Treebank Annotations Version 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Korean Treebank Annotations Version 2.0 is an extension of the Korean English
Treebank Annotations corpus, LDC2002T26 (2002). It is essentially an electronic corpus
of Korean texts annotated with morphological and syntactic information. The original
texts for the Korean Treebank 2.0 were selected from The Korean Newswire corpus published
by LDC, catalog number LDC2000T45, which is a collection of Korean Press Agency news
articles from June 2, 1994 to March 20, 2000. Korean Treebank 2.0 is based on the
March 2000 portion of the corpus and includes 647 articles. The annotated corpus can
find many uses, including training of morphological analyzers, part-of-speech taggers
and syntactic parsers. The text is encoded as KSC-5601(EUC-KR). Version 1.1 of the
treebank is included in this release.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
- Geographic subdivision:
Korea
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Han, Na-Rae
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ryu, Shijong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chae, Sook-Hee
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yang, Seung-yun
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Seunghun
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633828
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 971-561-706-841-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Spelled and Spoken Words
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the Spelled and Spoken Words Corpus, Linguistic
Data Consortium (LDC) catalog number LDC2006S15 and ISBN 1-58563-382-8. The Spelled
and Spoken Words corpus consists of spelled and spoken words. 3,647 callers were prompted
to to say and spell their first and last names, to say what city they grew up in and
what city they were calling from, and to answer two yes/no questions. In order to
collect sufficient instances of each letter, 1,371 callers also recited the English
alphabet with pauses between the letters. Each call was transcribed by two people,
and all differences were resolved. In addition, a subset of 2,648 calls has been phonetically
labeled.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Pronunciation.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fanty, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Roginski, K.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635456
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S26
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 672-454-108-628-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Speaker Recognition Version 1.1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S26
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the CSLU Speaker Recognition Corpus, Version 1.1,
Linguistic Data Consortium (LDC) catalog number LDC2006S26 and ISBN 1-58563-382-8.
The Speaker Recognition corpus (formerly known as Speaker Verification), consists
of telephone speech from 91 participants. Each participant has recorded speech in
twelve sessions over a two-year period answering questions like "what is your eye
color" or responding to prompts like "describe a typical day in your life." Most of
the utterances in the release of the corpus have corresponding non-time-aligned word
level transcriptions. In most of the CSLU data collections, each participant calls
a toll free telephone number and answers a few question. CSLU records the speech,
transcribes it, then packages it as a released corpus. The Speaker Recognition data
collection was quite a bit more complicated. The goal of the data collection was to
collect speech from each participant over a two-year period. Each participant called
call the data collection system 12 times over the two-year period and say the same
utterances each time. Some of the recording sessions were only a few days apart and
others several weeks apart. Participant followed the following calling schedule. During
the first month, they called twice in a week. No calls were made in the second and
third months. In the fourth month they made one call. No calls were made in the fifth
and sixth months. This pattern repeated three more times for a total of 12 calls per
participant. In order to balance the workload required to remind participants to call
and to avoid large data collection bursts on the system, the participants were divided
into 12 groups. Each group began the two-year schedule on subsequent months. The first
group started in September 1996. The second group started in October 1996. And so
on.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S26
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u tam d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634646
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008L01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 054-578-209-297-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
An English Dictionary of the Tamil Verb
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008L01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
An English Dictionary of the Tamil Verb represents over twenty-five years of work
led by Harold F. Schiffman, Professor, emeritus, of Dravidian Lingusitics and Culture
at the University of Pennsylvania's Department of South Asia Studies. It contains
translations for 6597 English verbs and defines 9716 Tamil verbs. This release presents
the dictionary in two formats: Adobe PDF and XML. The PDF format displays the dictionary
in a human readable form and is suitable for printing. The XML version is a purely
electronic form which, while readable by humans, is intended mainly for application
development and the creation of searchable electronic databases. In the electronic
XML version each entry contains the following: the English entry or head word; the
Tamil equivalent (in Tamil script and transliteration); the verb class and transitivity
specification; the spoken Tamil pronunciation (audio files in mp3 format); the English
definition(s); additional Tamil entries (if applicable); example sentences or phrases
in Literary Tamil, Spoken Tamil (with a corresponding audio file in .mp3 format) and
an English translation; and Tamil synonyms or near-synonyms, where appropriate. Some
foods referenced in the example sentences are illustrated in html files that include
detailed description of each dish. It is expected that the dictionary will be useful
for Tamil learners, scholars and others interested in the Tamil language. *The Tamil
Verb* Tamil is an official language of India, Singapore and Sri Lanka and has roughly
66 million native speakers worldwide. Most Tamil speakers live in the Tamil Nadu State
of India and northeastern Sri Lanka, but the extended diaspora includes Malaysia,
Mauritius and Singapore. Tamil is also a Classical Language of India. A member of
the Dravidian language family, it boasts a rich literary tradition stretching back
over 2200 years. Tamil is a diglossic language, meaning that it consists of at least
two distinct forms. Spoken Tamil (ST) refers to the numerous vernacular dialects,
and Literary Tamil (LT) refers to the form of the language used in print and most
broadcast news media. The dialects of Spoken Tamil fall along regional divisions and
along caste lines; there is no widely adopted standard for ST, although one seems
to be emerging. Educated Tamil speakers, on the other hand, generally use LT with
little variation in written communication. It appears, however, that a common dialect
of ST may be emerging as a result of a growing broadcast media and increased rates
of higher education. That dialect resembles the upper caste (non-Brahman) dialects
spoken in the urban centers of Tamil Nadu and borrows verbs from LT. The spoken examples
in An English Dictionary of the Tamil Verb reflect this emerging common dialect. Tamil
is also an agglutinative language, meaning that it constructs verbs by appending inflections
in the form of suffixes onto basic verb-stem morphemes. These inflections primarily
denote tense, aspect, voice and mood. As far as voice is concerned, however, though
LT may mark a verb as passive on occasion, ST rarely makes this distinction. These
suffixes mark whether a verb is transitive or intransitive, that is, they indicate
whether the subject is acted on by the verb or whether the subject is the actor. Mood
is implied by verb tense, but may also be provided by verbal auxiliaries expressing
various degrees of probability, futurity, ability, and their negatives. Tamil also
may add suffixes that mark aspectual distinctions, such as whether an action is considered
to be perfective ('complete' and/or 'definite') or whether it is ongoing or imperfective
('continuous' or 'durative'), as well as other distinctions. Aspect is a category
that is undergoing increasing grammaticalization and is therefore more usual in ST
than in LT. As is common with this process, aspectual distinctions are 'speaker-centered',
i.e. they provide personal observations (some analysts have referred to this as 'attitude'
or 'point of view', which is of course what the word 'aspect' originally means) which
describe the speaker's frame of mind concerning the event depicted in the sentence
-- whether it is perceived to be beneficial or detrimental, positive or negative,
voluntary or involuntary, etc. Aspectual distinctions vary widely among dialects,
both because of the variability of the grammaticalization process and for historical
reasons, and Tamil speakers can code-switch among different dialects depending on
context and audience. In addition, ST and LT treat the grammar of verbs differently.
Finding exact equivalents between English and Tamil verbs is very difficult as a result
of Tamil's diglossic nature and because of the difficulty of mapping English aspectual
distinctions onto Tamil aspectual categories. An English Dictionary of the Tamil Verb
seeks to meet needs not currently addressed by existing English-Tamil dictionaries.
The main goal of this dictionary is to get an English-knowing user to a Tamil verb,
irrespective of whether he or she begins with an English verb or some other item,
such as an adjective; this is because what may be a verb in Tamil may in fact not
be a verb in English, and vice versa. Since the number of English entries is limited
(slightly less than 10,000) there may not be main entries for certain low-frequency
items like 'pounce' but this item does appear as a synonym for 'jump, leap', and some
other verbs, so searching for 'pounce' will get the user to a Tamil verb via the synonym
field. The main goal is therefore to specifically concentrate on supplying the kinds
of information lacking in all previous attempts to capture the equivalencies between
English and Tamil. *Data* The text in the XML version of the dictionary is UTF-8.
A dtd and W3C Schema have been provided for validation. In addition, an example XSLT
style sheet has been provided to assist the novice in XML transformations.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Tamil and English. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Tamil language
- Form subdivision:
Dictionaries.
- General subdivision:
Verb
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Tamil language
- Form subdivision:
Dictionaries
- General subdivision:
English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schiffman, Harold
ADDED ENTRY--PERSONAL NAME
- Personal name:
Renganathan, Vasu
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008L01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u rus d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633887
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S34
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 301-264-944-856-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
rus
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
rus
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Russian through Switched Telephone Network (RuSTeN)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S34
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the Russian through Switched Telephone Network
(RuSTeN), Linguistic Data Consortium (LDC) catalog number LDC2006S34 and ISBN 1-58563-388-7.
This corpus was developed as part of ?Trawl? (Automatic Voice Identification System
in Telephone Channel). The purpose of the project was to develop software for automatic
identification of speakers based on voice samples acquired through telephone channels.
The training of the system was performed with the telephone speech corpus RuSTeN.
*Data* The RuSTeN (Russian through Switched Telephone Network) database was recorded
between March 2001 and February 2003 by Speech Technology Center using the "forget-me-not"
professional telephone recording and archiving software package developed by STC.
The files were recorded with sample frequency 11025 Hz, one-channel, 16-bit linear.
Each of the speakers made at least five calls from different locations and/or telephone
sets. Most of the calls were made from home or an office environment with uncontrolled
noise level. Additionally, one call per speaker was made from a public telephone (with
either street or metro station noise in the background). The recordings are spontaneous
(sometimes guided by the near-end speaker) conversations between the caller and the
speech database collector on various subjects (the weather, the caller's biography,
hobbies, etc.) and include approximately 150 seconds of the far-end and at least five
seconds of the near-end speaker. Besides, each time the caller was asked to utter
the usual digits set (0-9) and the words "yes" and "no." The time interval between
two successive sessions is at least two days. The database contains 125 speakers (far-end),
58 male and 67 female. Each far-end speaker is represented by at least five speech
files. The sound files are in the wav-format. The speech filenames contain the following
information: FFF (far-end speaker number) and SS (session number).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Russian. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Russian language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Raev, Anrey
ADDED ENTRY--PERSONAL NAME
- Personal name:
Koval, Serguei
ADDED ENTRY--PERSONAL NAME
- Personal name:
Smirnova, Natalia
ADDED ENTRY--PERSONAL NAME
- Personal name:
Khitrova, Daria
ADDED ENTRY--PERSONAL NAME
- Personal name:
Stepanov, Vitaly
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S34
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633607
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S36
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 875-309-087-531-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
West Point Korean Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S36
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on West Point Korean Speech, Linguistic Data Consortium
(LDC) catalog number LDC2006S36 and ISBN 1-58563-360-7. West Point Korean Speech is
a database of digital recordings of spoken Korean. Corpus design and data collection
were carried out by staff and faculty of the Department of Foreign Languages (DFL)
and Center for Technology Enhanced Language Learning (CTELL), located at the United
States Military Academy (USMA), West Point, New York. The corpus was designed to develop
speech recognition systems that would be used by the US government for speech-recognition
enhanced language learning courseware. The prompt scripts were created from 20,000
distinct sentences, along with a subset of prompts designed to elicit free response
answers to questions for use in domain-specific speech to speech translation systems.
Each speaker attempted to record 100 utterances. Three data collection scripts were
designed by Ms. Jennifer Son, a native speaker of Korean, under contract with the
Department of Foreign Languages.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Spoken Korean
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- Geographic subdivision:
United States.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morgan, John
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S36
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u vie d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633909
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S35
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 871-936-811-171-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ger
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
pes
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
deu
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Multilanguage Telephone Speech Version 1.2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S35
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Multilanguage Telephone Speech corpus consists of telephone speech from 11 languages:
English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil,
Vietnamese. The corpus contains fixed vocabulary utterances (eg. days of the week)
as well as fluent continuous speech. The current release includes recorded utterances
from about 2,052 speakers, for a total of about 38.5 hours of speech. Time-aligned
phonetic transcriptions for 619 of the utterances are also included. *Data* Each subject
called the CSLU data collection system by dialing a toll-free number. An analog telephone
line was connected to a Gradient Technologies box. Data from incoming calls were recorded
by the Gradient box. The sampling rate was 8 khz and the files were stored in 16-bit
linear format on a UNIX file system. Each utterance was recorded as a separate file.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Vietnamese, Tamil, Spanish, Iranian Persian, Korean, Japanese, Hindi, French,
English, German, and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muthusamy, Yeshwant
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Oshika, Beatrice
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S35
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u dut d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633445
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 632-458-830-271-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
dut
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
nld
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
N4 NATO Native and Non-Native Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation on the N4 NATO Native and Non-Native Speech Corpus,
Linguistic Data Consortium (LDC) catalog number LDC2006S13 and ISBN 1-58563-344-5.
The N4 NATO Native and Non-Native Speech corpus was developed by the NATO research
group on Speech and Language Technology in order to provide a military-oriented database
for multilingual and non-native speech processing studies. Speech data was recorded
in the naval transmission training centers of four countries (Germany, The Netherlands,
United Kingdom, and Canada). The material consists of native and non-native speakers
speakers using NATO English procedure between ships and reading from a text, "The
North Wind and the Sun," in both English and the speaker's native language. Speech
technology is covering an increasing number of languages, and systems are becoming
more robust with regard to speech variablity such as speaking style and accents. However,
for real applications, especially in a multilingual and multinational context, further
robustness to regional and even non-native accents is necessary. Among numerous corpora
available for speech research few have specifically addressed this issue. The NATO
Speech and Language Technology group decided to create a corpus geared towards the
study of non-native accents. The group chose naval communications as the common task
because it naturally includes a great deal of non-native speech and because there
were training facilities where data could be collected in several countries. *Data*
The database was collected in four countries (Germany, The Netherlands, United Kingdom,
and Canada) during naval communication training sessions in 2000-2002. For each country,
the main part of the recordings consists of a NATO Naval procedure in English where
the typical sentence sounds like "This is alpha, whiskey, roger. I make two seven
zero six hostile, two seven zero six. Out." In addition each speaker read a text,
"The North Wind and the Sun," in English and his or her native language. The audio
material was recorded on DAT and downsampled to 16kHz-16bit, and all the audio files
have been manually transcribed and annotated with speakers identities using the tool,
Transcriber. Navy procedure recordings and text readings have been stored in different
files. The first digit in the filename indicates the type of speech Among speech segments,
the duration of Navy procedure recordings range from 1.3h to 2.3h for a total of 7.5h.
The duration of the native language text readings range from 1.5min to 22.9min for
a total of around one hour. CA GE NL UK All Signal 5.30 3.20 5.00 6.30 19.80 Silence
3.00 0.56 2.00 4.70 10.26 Speech 2.30 2.64 3.00 1.60 9.54 Speech 2.30 2.64 3.00 1.60
9.54 Navy proc 2.00 1.90 2.30 1.30 7.50 Read text 0.30 0.74 0.70 0.30 2.04 Read text
0.30 0.74 0.70 0.30 2.04 Non-native 0.27 0.37 0.32 0.00 0.96 Native 0.03 0.37 0.38
0.30 1.08 The database contains the following information about each speaker: gender,
age, weight, length, possible speaking or hearing disorders, education level, living
area, accent, second language, the year English was learned(for non-native speakers).
The speaker accents vary widely from country to country. The speaker's average age
was 22.6 years. Nineteen women participated, accounting for 18% of the study participants.
There were a total of 115 speakers. CA GE NL UK All #Speakers 22 51 31 11 115 #Women
5 0 9 5 19 Age 22-35 17-23 17-61 19-62 17-62 Age mean 28.3 20.1 21 27.5 22.6
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Dutch, English, and German. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Cross-language information retrieval
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
German language
- Form subdivision:
Databases.
- General subdivision:
Spoken German
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Dutch language
- Form subdivision:
Databases.
- General subdivision:
Spoken Dutch
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grieco, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Benarousse, Laurent
ADDED ENTRY--PERSONAL NAME
- Personal name:
Geoffrois, Edouard
ADDED ENTRY--PERSONAL NAME
- Personal name:
Series, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Steeneken, Herman
ADDED ENTRY--PERSONAL NAME
- Personal name:
Stumpf, Hans
ADDED ENTRY--PERSONAL NAME
- Personal name:
Swail, Carl
ADDED ENTRY--PERSONAL NAME
- Personal name:
Thiel, Dieter
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u vie d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S31
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 610-601-655-546-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ger
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
pes
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
deu
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2003 NIST Language Recognition Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S31
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The goal of the NIST Language Recognition Evaluation (LRE) is to establish the baseline
of current performance capability for language recognition of conversational telephone
speech and to lay the groundwork for further research efforts in the field. The series
had its first evaluation in 1996. 2003 NIST Language Recognition Evaluation (LRE-03)
was part of this ongoing series of evaluations of language recognition technology.
Further information regarding this evaluation may be found on the 2003 NIST Language
Recognition Evaluation website and in the NIST 2003 evaluation plan. The task evaluated
was the detection of a given target language. Given a test segment of speech, a target
language was assigned as a test hypothesis, and the task was to determine whether
this test hypothesis was true or false. This release contains both the 1996 and 2003
NIST Language Recognition Evaluations. *Data* Each speech file is one side of a "four
wire" telephone conversation represented as 8-bit, 8kHz mulaw data. There are 11,830
speech files in sphere(.sph) format for a total of around forty six hours of speech.
The speech data was compiled from the LDC's CALLFRIEND, CALLHOME, and Switchboard-2
corpora. Each file contains one test segment. The test segments are divided into three-second,
ten-second, and thirty-second tests, each in its own directory.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Vietnamese, Tamil, Spanish, Iranian Persian, Korean, Japanese, Hindi, French,
English, German, Mandarin Chinese, and Egyptian Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pryzbocki, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S31
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u tur d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633844
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S33
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 461-254-833-604-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Middle East Technical University Turkish Microphone Speech v 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S33
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Middle East Technical University Turkish Microphone Speech v 1.0 was developed at
Middle East Technical University (METU) as part of a collaborative work between METU's
Department of Electrical and Electronics Engineering and the Center for Spoken Language
Research (CSLR) at the University of Colorado at Boulder. The collaboration was supported
by TUBITAK, the Scientific and Technical Research Council of Turkey, through a combined
doctoral scholarship program. The corpus was used to port CSLR's speech recognition
system, SONIC, to Turkish. The corpus contains text, speech and alignment files. The
corpus is of size ~600Mbytes. 120 speakers (60 male and 60 female) speak 40 sentences
each (aproximately 300 words per speaker), which makes approximately 500 minutes of
speech in total. The 40 sentences are selected randomly for each speaker from a triphone-balanced
set of 2,462 Turkish sentences. The speakers are selected from students, faculty and
staff at METU and all are native speakers of Turkish. The age range is from 19 to
50 years with an average of 23.9 years. The data has been digitally recorded with
a Sound Blaster sound card on a PC at a 16 kHz sampling rate.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Turkish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Turkish language
- Form subdivision:
Databases.
- General subdivision:
Spoken Turkish
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Salor, Ozgul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ciloglu, Tolga
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pellom, Bryan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Demirekler, Mubeccel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S33
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2005 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633518
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2005S15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 964-004-555-226-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
HKUST Mandarin Telephone Speech, Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2005]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2005S15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
HKUST Mandarin Telephone Speech, Part 1 was developed by Hong Kong University of Science
and Technology (HKUST). In 2004, HKUST was contracted to collect and transcribe 200
hours of Mandarin Chinese conversational telephone speech from Mandarin speakers in
mainland China under the DARPA EARS framework. The first 50 hours of speech and transcripts
were released in June 2004 to the EARS community for the RT-04 NIST evaluation. NIST
partitioned the remaining 150 hours of collection into training, development and evaluation
sets. This release contains the training and development sets with 873 and 24 calls,
respectively. *Data Collection* Subject recruitment was done in several cities across
mainland China. Most subjects did not previously know each other. To encourage more
meaningful conversation, topics similar to those in Fisher English were designed.
All calls were operator-assisted, namely, an operator would call two participants
as scheduled to initiate a call. Subjects were asked about demographic questions before
they were bridged for normal conversation. Their answers to the demographic questions
were recorded on separate files. Subjects were allowed to talk up to 10 minutes. With
a few exceptions, most calls are of the maximum length. Although subjects were allowed
to make up to three calls, all subjects made just one call in this release with one
exception, where PIN 10683 and PIN 10686 belong to a single individual. Each side
of a call was recorded on a separate .wav file, sampled at 8-bits (a-law encoded),
8Khz. They were multiplexed later in sphere format with a-law encoding preserved.
In the case where one side was shorter than the other, the shorter side was padded
with silence. In the release, the file name of each recorded call is in the format
of date_time_Apin_Bpin.sph and the corresponding transcript is in the same format
with .txt extension. *Speaker demographics* Subjects were asked to provide several
pieces of demographic information, including gender, age, native language/dialect,
birthplace, education, occupation, phone type, etc. Given that Standard Mandarin is
not the native dialect in many regions of China but is the official language of education
and speakers may or may not have regional accents speaking Mandarin, it was decided
that subjects birthplaces were divided into Mandarin-dominant and non-Mandarin-dominant
regions and all calls were audited and classified into standard and accented types
without further distinctions. Selected demographics - age, gender, birthplace, phone
type and accent for each side of the call and the topic ID for the call - are provided
as a tab-delimited, plain-text, tabular file.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fung, Pascale
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2005S15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634069
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 509-623-192-870-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Levantine Arabic Conversational Telephone Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This database contains 982 Levantine Arabic speakers taking part in spontaneous telephone
conversations in Colloquial Levantine Arabic. A total of 985 conversation sides are
provided (there are three speakers who each appear in two disctinct conversations).
The average duration per side is between 5 and 6 minutes. This corpus was collected
and transcribed in 2004 by Appen Pty Ltd, Sydney, Australia.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Levantine Arabic and South Levantine Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
Syria
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
Lebanon
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Appen Pty Ltd, Sydney, Australia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634077
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 028-999-912-000-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Levantine Arabic Conversational Telephone Speech, Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This database contains 982 Levantine Arabic speakers taking part in spontaneous telephone
conversations in Colloquial Levantine Arabic. A total of 985 conversation sides are
provided (there are three speakers who each appear in two disctinct conversations).
The average duration per side is between 5 and 6 minutes. This corpus was collected
and transcribed in 2004 by Appen Pty Ltd, Sydney, Australia.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Levantine Arabic and South Levantine Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
Syria
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
Lebanon
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Appen Pty Ltd, Sydney, Australia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634085
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 877-578-293-641-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English Chinese Translation Treebank v 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release of English Chinese Translation Treebank v. 1.0 consists of 146,300 words
in 325 files of individual news stories from Xinhua News Agency (corresponding to
the Xinhua data in Chinese Treebank 5.0 LDC2005T01) that are translated into English,
part-of-speech tagged and treebanked. The files were compressed using gzip. The source
files for the treebank annotation contain the final updated translation of these files.
Translation errors that prevented complete treebank annotation have been corrected.
This translation and annotation were completed in October 2004 and supersede any earlier
translation. This publication was compiled under National Science Foundation Grant
#IIS-0325646.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
- Geographic subdivision:
China
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mott, Justin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Warner, Colin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634093
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 614-675-002-053-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Tagged Chinese Gigaword
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan, is
the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition LDC2005T14.
It contains all of the data in Chinese Gigaword Second Edition -- from Central News
Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated with full part
of speech tags. In order to avoid any problems or confusion that could result from
differences in character-set specifications in the source data, all text files in
this corpus have been converted to UTF-8 character encoding. With some exceptions
described in the readme file, all characters in the text are either single-byte ASCII
or multi-byte Chinese. All sources have been categorized into four distinct "types":
* story: This type of DOC represents a coherent report on a particular topic or event,
consisting of paragraphs and full sentences. * multi: This type of DOC contains a
series of unrelated "blurbs," each of which briefly describes a particular topic or
event; examples include "summaries of today's news," "news briefs in ..." (some general
area like finance or sports), and so on. * advis: These are DOCs which the news service
addresses to news editors; they are not intended for publication to the "end users."
* other: These DOCs clearly do not fall into any of the above types; they include
items such as lists of sports scores, stock prices, temperatures around the world,
and so on. *Data* The table below lists the number files, their compressed and uncompressed
size, number of words and number of documents divided by source. #Files = number of
files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes.
K-words = number of words in thousands. #DOCs = number of documents. Source #Files
Rzip-MB Totl-MB K-wrds #DOCs CNA_CMN 168 994 7363 792195 1769953 XIN_CMN 168 615 4535
471110 992261 ZBN_CMN 10 40 223 28066 41418 TOTAL 346 1648 12121 1291371 2803632 The
following tables present the quantity of "K-wrds" and "#DOCS", divided by source and
DOC type: #DOCs K-wrds type="advis": CNA_CMN 8160 751 XIN_CMN 6553 711 ZBN_CMN 0 0
TOTAL 14713 1462 type="multi": CNA_CMN 30552 23429 XIN_CMN 11329 7516 ZBN_CMN 55 41
TOTAL 41936 30986 type="other": CNA_CMN 100758 40258 XIN_CMN 31255 9999 ZBN_CMN 279
130 TOTAL 132292 50387 type="story": CNA_CMN 1630483 727748 XIN_CMN 943132 452878
ZBN_CMN 41084 27898 TOTAL 2614691 1208524 The performance of CKIP Segmentation and
POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006. The test result
is shown as follows: Doc# RefWord# TestWord# MatchWord# Recall (%) Precision (%) F-Score
(%) Bakeoff 2005 190 116509 116443 112091 96.2 96.3 96.2 Bakeoff 2006 148 90405 90327
87332 96.6 96.7 96.6 Note: Recall=MatchWord# / RefWord# Precision=MatchWord# / TestWord#
F-Score=2 * Recall * Precision / (Recall + Precision)
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
History, Modern
- Form subdivision:
Databases.
- Chronological subdivision:
1989-
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Chu-Ren
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u por d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633836
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 386-396-917-783-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
por
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
por
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Spoltech Brazilian Portuguese Version 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CSLU: Spoltech Brazilian Portuguese Version 1.0, Linguistic Data Consortium (LDC)
catalog number LDC2006S16 and ISBN 1-58563-383-6, contains microphone speech from
a variety of regions in Brazil with phonetic and orthographic transcriptions. The
utterances consist of both read speech (for phonetic coverage) and responses to questions
(for spontaneous speech). The corpus contains 477 speakers and 8,080 separate utterances.
A total of 2,540 utterances have been transcribed at the word level (without time
alignments), and 5,479 utterances have been transcribed at the phoneme level (with
time alignments). Protocol design, recording and transcription were performed by the
Universidade Federal do Rio Grande do Sul and the Universidade de Caxias do Sul. *Data*
The data has been recorded at 44.1 kHz (mono, 16-bit) and stored in RIFF format. The
recording was conducted with a direct connection from the microphone to the sound
card. The sound card was SoundBlaster-compatible. For the prompted sentences, the
sentence was hidden from view when recording began, so that the speaker might utter
the sentence more naturally. Verification of the recording quality was performed immediately
after each utterance recording; the data-collection software allowed the speaker to
re-record utterances in case the recording was not of sufficient quality. The acoustic
environment was not controlled, in order to allow for background conditions that would
occur in application environments.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Portuguese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Portuguese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Portuguese
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Portuguese language
- Form subdivision:
Databases.
- General subdivision:
Dialects
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schramm, Mauricio C.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Freitas, Luis Felipe R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zanuz, Adriano
ADDED ENTRY--PERSONAL NAME
- Personal name:
Barone, Dante
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634913
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 347-060-360-170-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Czech Academic Corpus 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Prague family of annotated corpora has a new member, the Czech Academic Corpus
version 2.0 (CAC 2.0). CAC 2.0 consists of 650,000 words from various 1970s and 1980s
newspapers, magazines and radio and television broadcast transcripts manually annotated
for morphology and syntax. The CAC 2.0 offers: * For linguists: language material
reflecting the real usage of the language. * For computational linguists: tools and
a considerable amount of data for natural language applications that are not feasible
without morphological and syntactical text processing. * For TrEd annotation tool
users: the possibility to use voice control for the tool. * For teachers and their
students: an interesting didactic tool for practising Czech language morphology and
syntax. The CAC was created by a team from the Institute of the Czech Language, the
Academy of Sciences of the Czech Republic, led by Marie Těitelová, during the period
from 1971 to 1985. The original purpose of the corpus was to build a frequency dictionary
of the Czech language. Researchers were aware, however, that in order to make the
CAC useful for future users, whether linguists or natural language processing systems
developers, it was necessary to design annotation schemes and to develop tools that
would add as much linguistic information as possible to the data. In 1996, the Prague
Dependency Treebank (PDT), which provided morphological and syntactic analytic layers
of annotation to certain Czech media data, was launched independently of the CAC.
During the work on the PDT's second version, its researchers decided to transfer PDT's
internal format and annotation scheme to the CAC with the goals of making the CAC
and the PDT fully compatible and of integrating the CAC into the PDT. To that end,
the CAC was manually annotated for morphology and syntax. CAC 2.0 adds the surface
syntax annotation; in the terminology of the PDT, this annotation is called an analytical
layer. The following PDT resources are available from LDC: Prague Dependency Treebank
1.0, LDC2001T10, Prague Dependency Treebank 2.0, LDC2006T01, Prague Arabic Dependency
Treebank 1.0, LDC2004T23 and Prague Czech-English Dependency Treebank 1.0. *Annotation
Description and Examples* A morphological layer of annotation provides the word tokens
with further data (annotation), which characterizes the morphological properties of
the word tokens (as apparent in the lemma which is the canonical form of a lexeme),
the part of speech, and morphological categories (case, number, tense, person, etc.).
Formally, part of speech classes combine together with values of morphological categories
to represent morphological tags (or, simply, tags). In the CAC 2.0, tags are designed
according to the PDT as strings of definite length (15 positions) where each position
corresponds to a single category. Example: The word form Prahu (a form of "Prague")
is analysed as an affirmative (11th position) noun (1st and 2nd position), feminine
(3rd position), singular (4th position), and accusative (5th position). All of the
other positions are correctly filled with the symbol "-" that represents the irrelevance
of the morphological category towards the part of speech. For example, one does not
determine a person and tense with nouns (8th and 9th position). Examples of lemmas
and tags of particular word forms Word token Lemma Tag Description Prahu Praha NNFS4-----A----
Noun, feminine, singular, accusative, affirmative 123 123 C=------------- Digit token
) ) Z:------------- Punctuation mark (right parenthesis) An a-layer annotation assigns
each word unit the corresponding data characterising the syntactical features of the
unit and therefore its relation to the other sentence elements along with its sentence
function. Formally, the sentence relations are represented by a dependency tree. Example:
Syntactical annotation of the sentence Obecná odpověď na tuto otázku je sotva mo-ná.(Lit.:
A general response to this question is hardly possible.) Each word unit (word, number,
punctuation mark) is represented by a single node in the resulting tree. Note that
due to technical reasons each tree is rooted by one extra node - the tree in our example
therefore consists of 9 nodes. The annotation approach builds on the tradition of
the Prague linguistic school, where the predicate (usually verb) is understood to
be the centre of the sentence. Therefore the predicate is placed as a direct daughter
of the root. The final punctuation is also placed as a daughter of the root node.
Two constituents of the sentence are dependent on the predicate - odpověď (answer)
and mo-ná (possible). Please note that each node in the tree is annotated with the
word form, lemma, morphological tag and analytic function. Looking at the node representing
the word odpověď (answer), we can see its form is a feminine noun in nominative singular
and that this unit stands in the role of subject of the sentence, which is expressed
by the analytic function Subj. Example of an a-layer annotation The conception of
the main internal format of the CAC 2.0 treats the annotation layers separately where
each layer of annotation in the document corresponds to one file. (In the case of
the CSTS format, all layers of annotation are contained in one file.) This relationship
in the CAC 2.0 means that there are three instances (files) for every document, one
for the w-layer, one for the m-layer and a third one for the a-layer. However, the
distinction between layers does not restrict interconnection between groups for particular
layers of annotation. In fact, the opposite is true as will be demonstrated later
in this section. The word layer does not reflect the segmentation of the text into
sentences; this segmentation occurs on the m-layer. This means that unlike the w-layer,
the m-layer contains final punctuation. Additionally, the number of word tokens in
both layers may differ. The differences originate from the concatenation of the incorrectly
split word into one word, or reversely, from the division of incorrectly connected
words into more units. The correctly written text should be contained in the m-layer.
Example: The three following figures illustrate the w-layer and m-layer interconnection.
Also the interconnection of the files in the sense of the number of word units is
captured and denoted by arrows. All three examples were chosen from the CAC 2.0 deliberately
so that the user can directly view the instances; the name of the document and number
of the sentence is provided for every sentence. Figure 2.2 serves to illustrate the
1:1 ratio of the layers. The layers do not differ except for the final punctuation.
Technical interconnection of the w-layer and m-layer: The insertion of a word token
exemplifies the situation where a word token is inserted into the text - the year
information was clearly missing. Since it is almost impossible for the corrector to
add the missing year, the symbol "#" is used as this symbol has no counterpart on
the w-layer. In contrast, Figure 2.4 illustrates the situation where more m-layer
units corresponds to the same w-layer unit - the word unit pedagogicko-psychologické
(E: psychological-pedagogical) has been divided into three separate units. Technical
interconnection of the w-layer and m-layer: No changes other than the final-sentence
punctuation Figure 2.3. Technical interconnection of the w-layer and m-layer: The
insertion of a word tokend Figure 2.4. Technical interconnection of the w-layer and
m-layer: The division of a word token The interconnection between the a-layer and
m-layer means that each m-layer word unit corresponds exactly to one node of the dependency
tree on the a-layer, and vice versa. The only exception is the technical root, which
has no counterpart on the m-layer. *Corpus Tools* CAC 2.0 contains the following tools:
* Bonito: a corpus manager that searches CAC 2.0 texts. * LAW: a morphological annotations
editor. * TrEd: a syntactical annotations editor. * Negraph: a corpus viewer. * tool_chain:
automatically processes Czech texts.
LANGUAGE NOTE
- Language note:
Content in Czech. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- Form subdivision:
Databases.
- General subdivision:
Morphology
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- Form subdivision:
Databases.
- General subdivision:
Syntax
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hladká, Barbora Vidová
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajič, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hana, Jiří
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hlaváčová, Jaroslava
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mírovský, Jiří
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633984
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S42
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 762-674-512-341-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Korean Broadcast News Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S42
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This data set consists of 18 audio files recorded by LDC in January 2000 and February
2000 from Voice of America (VOA) satellite radio news broadcasts in Korean. *Data*
The recordings, captured from a dedicated satellite receiver, are stored as 16-bit
PCM, 16-kHz, single-channel, in NIST SPHERE format. The duration of each recording
is either 30 minutes or 60 minutes, depending on the VOA broadcast schedule. The date
(YYYYMMDD), start-time and end-time (HHMM, Eastern Standard Time) for each recording
are indicated in its file name. The sample data is not compressed. Transcripts for
these recordings are available as a separate corpus from the LDC: Korean Broadcast
News Transcripts, LDC2006T14.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Spoken Korean
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
- Geographic subdivision:
Korea
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S42
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633976
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 831-344-220-094-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Web 1T 5-gram Version 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Web 1T 5-gram Version 1, contributed by Google Inc., contains English word n-grams
and their observed frequency counts. The length of the n-grams ranges from unigrams
(single words) to five-grams. This data is expected to be useful for statistical language
modeling, e.g., for machine translation or speech recognition, as well as for other
uses. *Source Data* The n-gram counts were generated from approximately 1 trillion
word tokens of text from publicly accessible Web pages. *Character Encoding* The input
encoding of documents was automatically detected, and all text was converted to UTF8.
*Tokenization* The data was tokenized in a manner similar to the tokenization of the
Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:
* Hyphenated word are usually separated, and hyphenated numbers usually form one token.
* Sequences of numbers separated by slashes (e.g. in dates) form one token. * Sequences
that look like urls or email addresses form one token. *Data Sizes* File sizes: approx.
24 GB compressed (gzip'ed) text files Number of tokens: 1,024,908,267,229 Number of
sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401
Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams:
1,176,470,663 *Sample Data* The following is an example of the 3-gram data contained
this corpus: ceramics collectables collectibles 55 ceramics collectables fine 130
ceramics collected by 52 ceramics collectible pottery 50 ceramics collectibles cooking
45 ceramics collection , 144 ceramics collection . 247 ceramics collection 120 ceramics
collection and 43 ceramics collection at 52 ceramics collection is 68 ceramics collection
of 76 ceramics collection | 59 ceramics collections , 66 ceramics collections . 60
ceramics combined with 46 ceramics come from 69 ceramics comes from 660 ceramics community
, 109 ceramics community . 212 ceramics community for 61 ceramics companies . 53 ceramics
companies consultants 173 ceramics company ! 4432 ceramics company , 133 ceramics
company . 92 ceramics company 41 ceramics company facing 145 ceramics company in 181
ceramics company started 137 ceramics company that 87 ceramics component ( 76 ceramics
composed of 85 ceramics composites ferrites 56 ceramics composition as 41 ceramics
computer graphics 51 ceramics computer imaging 52 ceramics consist of 92 The following
is an example of the 4-gram data in this corpus: serve as the incoming 92 serve as
the incubator 99 serve as the independent 794 serve as the index 223 serve as the
indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable
111 serve as the indispensible 40 serve as the individual 234 serve as the industrial
52 serve as the industry 607 serve as the info 42 serve as the informal 102 serve
as the information 838 serve as the informational 41 serve as the infrastructure 500
serve as the initial 5331 serve as the initiating 125 serve as the initiation 63 serve
as the initiator 81 serve as the injector 56 serve as the inlet 41 serve as the inner
87 serve as the input 1323 serve as the inputs 189 serve as the insertion 49 serve
as the insourced 67 serve as the inspection 43 serve as the inspector 66 serve as
the inspiration 1390 serve as the installation 136 serve as the institute 187 serve
as the institution 279 serve as the institutional 461 serve as the instructional 173
serve as the instructor 286 serve as the instructors 161 serve as the instrument 614
serve as the instruments 193 serve as the insurance 52 serve as the insurer 82 serve
as the intake 70 serve as the integral 68
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Language and languages
- Form subdivision:
Databases.
- General subdivision:
Word frequency
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
- General subdivision:
Statistical methods
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine learning.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brants, Thorsten
ADDED ENTRY--PERSONAL NAME
- Personal name:
Franz, Alex
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585633992
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 030-163-048-673-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Korean Broadcast News Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This data set consists of 18 text files containing transcripts prepared by the LDC
for Voice of America satellite radio news broadcasts in Korean. The broadcasts were
recorded by the LDC at transmission time during a two week period between January
21, 2000 and February 7, 2000. *Data* Nine of the broadcasts are 30 minutes long,
and the other nine broadcasts are 60 minutes long. The file names indicate the date
(YYYYMMDD)and the begin and end times (HHMM EST) of the original transmission. The
character encoding is Unicode UTF-8, and the file contents are structured using SGML.
The markup strategy used here was defined by NIST specifically for use in transcripts
of broadcast news speech. The "docs" directory provides a working DTD file, a complete
description (in the form of a PostScript file) of the document structure, tags and
attributes, and a simple text file listing the 18 data file names in the corpus. The
transcripts have been manually time aligned at the phrasal level and annotated to
identify boundaries between news stories and speaker turns; speaker names and gender
are given where identifiable. These annotations are all provided via the SGML tags
and their attributes. A strong effort has been made to identify all unique speakers
across the transcripts. However, there may be cases where an individual speaker has
not been recognized and has been given a unique, anonymous identification. Audio files
for these transcripts are available as a separate corpus from the LDC: LDC2006S42.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
- Geographic subdivision:
Korea
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u ara d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S43
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 860-289-087-911-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
afb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Gulf Arabic Conversational Telephone Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S43
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This database contains 975 Gulf Arabic speakers taking part in spontaneous telephone
conversations in Colloquial Gulf Arabic. A total of 976 conversation sides are provided
(one speaker appears on two distinct calls). The average duration per side is about
5.7 minutes. This corpus was collected and transcribed in 2004 by Appen Pty Ltd (Appen),
Sydney, Australia. *Data* The single-channel files represent just one side of a normal
conversation. The "devtest" set represents a relatively balanced (representative)
sample drawn from the total pool of collected calls, based on a test-set selection
process applied by the National Institute of Standards and Technology (NIST) and based
on demographic, phone and audit information as provided by Appen.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Gulf Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Appen Pty Ltd, Sydney, Australia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S43
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634026
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S44
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 214-123-995-004-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2004 NIST Speaker Recognition Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S44
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 2004 NIST Speaker Recognition evaluation is part of an ongoing series of yearly
evaluations conducted by NIST (National Institute of Standards and Technology). These
evaluations provide an important contribution to the direction of research efforts
and the calibration of technical capabilities. They are intended to be of interest
to all researchers working on the general problem of text-independent speaker recognition.
To this end the evaluation was designed to be simple, to focus on core technology
issues, to be fully supported, and to be accessible. NIST has been coordinating Speaker
Recognition Evaluations since 1996. Each evaluation begins with the announcement of
the official evaluation plan which clearly states the rules and tasks involved with
the evaluation. The evaluation culminates with a follow-up workshop, where NIST reports
the official results and researchers share in their findings. The data consists of
conversational telephone speech collected by the LDC. Additional documentation is
available from the NIST website at http://www.itl.nist.gov/iad/mig/tests/sre/2004/index.html.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S44
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634034
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S45
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 901-237-025-344-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ayp
- Language code of text/sound track or separate title:
afb
- Language code of text/sound track or separate title:
acm
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Iraqi Arabic Conversational Telephone Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S45
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This database contains 474 Iraqi Arabic speakers taking part in spontaneous telephone
conversations in Colloquial Iraqi Arabic. A total of 478 conversation sides are provided
(most speakers appear only once), and most of these call sides comprise both sides
of a conversation (that is, 202 two-channel recordings plus 74 single-channel recordings).
The average duration per call is about 6 minutes, so each call side contains about
3 minutes of speech, on average. This corpus was collected and transcribed in 2003
and 2004 by Appen Pty Ltd, Sydney, Australia.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Mesopotamian Arabic, Gulf Arabic, and Mesopotamian Arabic. Documentation
in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Appen Pty Ltd, Sydney, Australia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S45
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634042
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 799-608-899-988-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ayp
- Language code of text/sound track or separate title:
afb
- Language code of text/sound track or separate title:
acm
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Iraqi Arabic Conversational Telephone Speech, Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This database contains 474 Iraqi Arabic speakers taking part in spontaneous telephone
conversations in Colloquial Iraqi Arabic. A total of 478 conversation sides are provided
(most speakers appear only once), and most of these call sides comprise both sides
of a conversation (that is, 202 two-channel recordings plus 74 single-channel recordings).
The average duration per call is about 6 minutes, so each call side contains about
3 minutes of speech, on average. This corpus was collected and transcribed in 2003
and 2004 by Appen Pty Ltd, Sydney, Australia.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Mesopotamian Arabic, Gulf Arabic, and Mesopotamian Arabic. Documentation
in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Appen Pty Ltd, Sydney, Australia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u fre d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634050
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 351-085-945-382-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
fre
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
fra
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
French Gigaword First Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
French Gigaword First Edition is a comprehensive archive of newswire text data that
has been acquired over several years by the Linguistic Data Consortium (LDC) at the
University of Pennsylvania. The two distinct international sources of French newswire
in this edition, and the time spans of collection covered for each, are as follows:
* Agence France-Presse (afp_fre) May 1994 - July 2006 * Associated Press French Service
(apw_fre) Nov 1994 - July 2006 The seven-letter codes in parentheses include the three-character
source name abbreviations and the three-character language code ("fre") separated
by an underscore ("_") character. The three-letter language code conforms to LDC's
new internal convention based on the ISO 639-3 standard. The overall totals for each
source are summarized below. Note that the "Totl-MB" numbers show the amount of data
you get when the files are uncompressed (i.e. approximately 15 gigabytes, total);
the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM;
the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all
types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFP_FRE 147 1139 3445 482904 1797139 APW_FRE 141 389 1167 167405 622740 TOTAL 288
1528 4612 650309 2419879 The following tables present "Text-MB", "K-wrds" and "#DOCS"
broken down by source and DOC type; "Text-MB" represents the total number of characters
(including whitespace) after SGML tags are eliminated. Source Text-MB K-wrds #DOCs
type="advis": AFP_FRE 79 10924 47044 APW_FRE 8 1381 6291 TOTAL 87 12305 53335 type="multi":
AFP_FRE 40 5964 6828 >APW_FRE 118 18527 29797 TOTAL 158 24491 36625 type="other":
AFP_FRE 169 23723 155571 APW_FRE 72 11006 68429 TOTAL 241 34729 224000 type="story":
AFP_FRE 2848 442284 1587696 APW_FRE 866 136481 518223 TOTAL 3715 578765 2105919
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in French. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
French language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Information retrieval.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science).
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634387
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 483-111-978-894-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Fisher Levantine Arabic Conversational Telephone Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Levantine Arabic is spoken along the western Mediterranean coast from Anatolia to
the Sinai Peninsula and encompasses the local dialects of Lebanon, Syria and Palestine.
There are two distinct varieties: Northern, centered around Syria and Lebanon; and
Southern, spoken in Jordan and Palestine. Northern Levantine Arabic speakers include
approximately 8.8 million speakers in Syria and 6 million speakers in Lebanon. Southern
Levantine Arabic speakers include approximately 3.5 million speakers in Jordan, 1.6
million speakers in Palestine and nearly one million speakers in Israel. Fisher Levantine
Arabic Conversational Telephone Speech contains 279 telephone conversations totaling
45 hours of speech. The majority of the speakers are from Jordan, Lebanon and Palestine.
Speaker Distribution by Region Jordan 60% Palestine 15% Lebanon 15% Syria 8% other
2% The Fisher telephone conversation collection protocol was created at LDC to address
a critical need of developers trying to build robust automatic speech recognition
(ASR) systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II
and the resulting corpora, have been adapted for ASR research but were in fact developed
for language and speaker identification respectively. Although the CALLHOME protocol
and corpora were developed to support ASR technology, they feature small numbers of
speakers making telephone calls of relatively long duration with narrow vocabulary
across the collection. CALLHOME conversations are challengingly natural and intimate.
Under the Fisher protocol, a very large number of participants each make a few calls
of short duration speaking to other participants, whom they typically do not know,
about assigned topics. This maximizes inter-speaker variation and vocabulary breadth
although it also increases formality. Previous protocols such as CALLHOME, CALLFRIEND
and Switchboard relied upon participant activity to drive the collection. Fisher is
unique in being platform driven rather than participant driven. Participants who wish
to initiate a call may do so; however the collection platform initiates the majority
of calls. Participants need only answer their phones at the times they specified when
registering for the study. To encourage a broad range of vocabulary, Fisher participants
are asked to speak on an assigned topic which is selected at random from a list, which
changes every 24 hours and which is assigned to all subjects paired on that day. Some
topics are inherited or refined from previous Switchboard studies while others were
developed specifically for the Fisher protocol. *Data* The conversations in this corpus
are a subset of the conversations in Levantine Arabic QT Training Data Set 5, Speech,
LDC2006S29. The individual audio files are in NIST Sphere format. The corresponding
transcripts may be found in Fisher Levantine Arabic Conversational Telephone Speech,
Transcripts, LDC2007T04.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Levantine Arabic and South Levantine Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
Jordan
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
Lebanon
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
Palestine
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634115
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 146-188-087-767-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Levantine Arabic is spoken along the western Mediterranean coast from Anatolia to
the Sinai Peninsula and encompasses the local dialects of Lebanon, Syria and Palestine.
There are two distinct varieties: Northern, centered around Syria and Lebanon and
Southern, spoken in Jordan and Palestine. Northern Levantine Arabic speakers include
approximately 8.8 million speakers in Syria and 6 million speakers in Lebanon. Southern
Levantine Arabic speakers include approximately 3.5 million speakers in Jordan, 1.6
million speakers in Palestine and nearly one million speakers in Israel. Fisher Levantine
Arabic Conversational Telephone Speech, Transcripts contains transcripts for 279 telephone
conversations. The majority of the speakers are from Jordan, Lebanon and Palestine.
The corresponding telephone speech is contained in Fisher Levantine Arabic Conversational
Telephone Speech. Speaker Distribution by Region Jordan 60% Palestine 15% Lebanon
15% Syria 8% other 2% The Fisher telephone conversation collection protocol was created
at LDC to address a critical need of developers trying to build robust automatic speech
recognition (ASR) systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II
and the resulting corpora, have been adapted for ASR research but were in fact developed
for language and speaker identification respectively. Although the CALLHOME protocol
and corpora were developed to support ASR technology, they feature small numbers of
speakers making telephone calls of relatively long duration with narrow vocabulary
across the collection. CALLHOME conversations are challengingly natural and intimate.
Under the Fisher protocol, a very large number of participants each make a few calls
of short duration speaking to other participants, whom they typically do not know,
about assigned topics. This maximizes inter-speaker variation and vocabulary breadth
although it also increases formality. Previous protocols such as CALLHOME, CALLFRIEND
and Switchboard relied upon participant activity to drive the collection. Fisher is
unique in being platform driven rather than participant driven. Participants who wish
to initiate a call may do so however the collection platform initiates the majority
of calls. Participants need only answer their phones at the times they specified when
registering for the study. To encourage a broad range of vocabulary, Fisher participants
are asked to speak on an assigned topic which is selected at random from a list, which
changes every 24 hours and which is assigned to all subjects paired on that day. Some
topics are inherited or refined from previous Switchboard studies while others were
developed specifically for the Fisher protocol. *Data* The transcripts were created
with green and yellow layers using LDC's Multi-Dialectal Transcription Tool (AMADAT).
The green layer seeks to anchor dialectal forms to similar or related Modern Standard
Arabic orothgraphy-based forms. The yellow layer is a more careful and detailed transcription
that adds functionally necessary vowels and marks important sociolinguistic variations
and morphophonemic features. The green-layer transcripts in this corpus are a subset
of the transcripts contained in Levantine Arabic QT Training Data Set 5, Transcripts,
LDC2006T07. The yellow-layer transcription was added in this release.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in North Levantine Arabic and South Levantine Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
Jordan
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
Lebanon
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- Geographic subdivision:
Palestine
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Buckwalter, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert (author)
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u urd d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634123
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 513-040-223-174-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
urd
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
urd
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ARL Urdu Speech Database, Training Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for ARL Urdu Speech Database, Training Data, Linguistic
Data Consortium (LDC) catalog number LDC2007S03 and isbn 1-58563-421-3. The recordings
in this release were collected by Appen Pty Ltd, Sydney, Australia in 2006. The U.S.
Army Research Laboratory (ARL) provided this corpus to the LDC for distribution. Urdu
is an Indo-Aryan language spoken throughout South Asia that developed under the Mughal
Empire and Delhi Sultinate between 1200 AD and 1800 AD. It has Persian, Turkish and
Arabic influences, but in fact is a dialect of Hindustani. The word "Urdu" refers
to the standardized register of Hindustani, but there are many non-standard idiolects
as well. Urdu is the twentieth most spoken language in the world. It is the native
language of over 60 million people, it is the offical language of Pakistan, and it
is one of India's national languages. Urdu is also spoken in Afghanistan. The ARL
Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu
speakers from Pakistan and Northern India. The distribution of speaker dialects is
as follows: Accent Number of Speakers South Sindh 29 North Sindh 30 South Punjab 27
North Punjab 29 Captial Area 29 North West Regions 30 Baluchistan 26 The database
is divided into two parts, a training set containing approximately 80% of the data
and a test set comprised of 20% of the data. This release consists of approximately
80% of the complete dataset (training and test). *Data* Each speaker was presented
with 400 prompts to read: sentences, place names, and person names. Two microphones
set at different distances to the speaker were used for the recordings. The recorded
speech was stored in raw format files with headers stored in separate directories.
Each utterance is transcribed in the corresponding label file for each recording.
The transcriptions were encoded in UTF-8. Punctuation was omitted and numbers were
written out in full. *Update* Earlier versions were missing the content list file.
This is now available as part of the complete download file.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Urdu. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Urdu language
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Appen Pty Ltd, Sydney, Australia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634166
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 336-874-552-847-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English Gigaword Third Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The English Gigaword Corpus is a comprehensive archive of newswire text data that
has been acquired over several years by the Linguistic Data Consortium (LDC) at the
University of Pennsylvania. This is the third edition of the English Gigaword Corpus.
This edition includes all of the contents in the previous edition (LDC2005T12) as
well as new data from the same five sources presented there covering 24-month period
of January 2005 through December 2006. Also, a sixth data source (the Los Angeles
Times/Washington Post newswire service) has been added in this edition. The six distinct
international sources of English newswire included in this edition are the following:
Agence France-Presse, English Service (afp_eng) Associated Press Worldstream, English
Service (apw_eng) Central News Agency of Taiwan, English Service (cna_eng) Los Angeles
Times/Washington Post Newswire Service (ltw_eng) New York Times Newswire Service (nyt_eng)
Xinhua News Agency, English Service (xin_eng) The seven-letter codes in the parentheses
above include the three-character source name abbreviations and the three-character
language code ("eng") separated by an underscore ("_") character. The three-letter
language code conforms to LDC's internal convention based on the new ISO 639-3 standard.
The seven-letter codes are used in both the directory names where the data files are
found, and in the prefix that appears at the beginning of every data file name. As
with other Gigaword releases, some of the content in the this corpus has been published
previously by the LDC in a variety of other, older corpora, particularly the North
American News text corpora, the various TDT corpora, and the AQUAINT text corpus,
as well as earlier editions of Gigaword English. *New in the Third Edition* * New
newswire data contents from January 2005 to December 2006 have been added for all
of the five newswire sources that were represented in the first edition. * A new source,
the Los Angeles Times/Washington Post newswire service, has been added. * A small
handful of corrections to older APW data have been made to remove a few non-English
stories, clean up some character "noise", and rectify the encoding for a few non-ASCII
characters. * The CNA content introduced in Gigaword English 2nd Edition has been
completely updated to repair data corruptions caused by occasional character encoding
problems; as a result of the update, there may be differences in the inventory and/or
ID strings of DOC elements in this portion of the corpus, relative to the previous
edition. (The nature of encoding problems is explained below under "SOURCE SPECIFIC
PROPERTIES".) * Many of the files (141 out of 722) include a small number of UTF-8
"wide" characters, typically accented letters found in proper names and borrowed words
(some sources also use special punctuation marks, non-breaking spaces, etc). Apart
from the replacement/update of all CNA files, the data content of the 2nd edition
has been included in the present release without modification.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634174
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 492-763-586-162-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TDT5 Multilingual Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The TDT5 corpora were created by Linguistic Data Consortium with support from the
DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program.
This release contains the complete set of English, Arabic and Chinese newswire text
used in the 2004 Topic Detection and Tracking technology evaluations. The topic relevance
annotations corresponding to this publication can be found in LDC Publication LDC2006T19,
TDT5 Topics and Annotations. Topic Detection and Tracking (TDT) refers to automatic
techniques for finding topically related material in streams of data such as newswire
and broadcast news. There were four TDT tasks defined for the 2004 evaluation: the
tracking of known topics, the detection of unknown topics, the detection of initial
stories on unknown topics, and the detection of pairs of stories on the same topic
(links). Of these four tasks, the topic tracking task and the link detection task
are considered to be "primary." Previous TDT evaluations also included a story segmentation
task. This task applied only to broadcast news. Since TDT5 does not include broadcast
news, there is no story segmentation task in the 2004 TDT Evaluation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634190
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006S46
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 537-141-493-555-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Broadcast News Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006S46
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Broadcast News Speech consists of 10 hours of speech recorded by the Linguistic
Data Consortium (LDC) from Voice of America satellite radio news broadcasts in Arabic
transmitted between June 2000 and January 2001. The corresponding transcripts are
available as Arabic Broadcast News Transcripts (LDC2006T20). This work was undertaken
in the Networking Data Centers (NetDC) project (MLIS-5017, NSF IIS-9982201) in conjunction
with the European Language Resources Association (ELRA). ELRA collected 22.5 hours
of Arabic broadcast data from Radio Orient (France) that is available in NetDC Arabic
BNSC (Broadcast News Speech Corpus) ELRA-S0157. The goal of the NetDC project was
to improve the infrastructure for language resources by designing and implementing
new modes of cooperation between LDC and ELRA. *Data* The recordings were captured
from a dedicated satellite receiver and stored as 16-bit PCM, 16-kHz, single-channel,
in NIST SPHERE format. The duration of each recording is either 60 minutes or 120
minutes, depending on the VOA broadcast schedule; the date (YYYYMMDD), start-time
and end-time (HHMM EST) for each recording are indicated in the file names. The sample
data are not compressed.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006S46
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2006 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634204
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2006T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 476-762-568-967-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Broadcast News Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2006]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2006T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium
(LDC) and consists of ten hours of transcribed speech from Voice of America satellite
radio news broadcasts in Arabic recorded by LDC between June 2000 and January 2001.
The corresponding speech files are available in Arabic Broadcast News Speech (LDC2006S46).
This work was undertaken in the Networking Data Centers (NetDC) project (MLIS-5017,
NSF III-9982201) in conjunction with the European Language Resources Association (ELRA).
ELRA transcribed 22.5 hours of Arabic broadcast data from Radio Orient (France) that
is available in NetDC Arabic BNSC (Broadcast News Speech Corpus) (ELRA-S0157). The
goal of the NetDC project was to improve the infrastructure for language resources
by designing and implementing new modes of cooperation between LDC and ELRA. *Data*
The character encoding is entirely in ASCII; Buckwalter transliteration is used for
rendering the Arabic text content. Time alignment and structural markup are rendered
via "pseudo-SGML" tags, which are presented one tag per line, with the first character
of the line being an open angle bracket. The lines of transcription text (i.e. the
speech and annotation content between the time-stamp tags) all begin with a single
space character, and present exactly one token per line. (A "token" may be a spoken
Arabic word, a punctuation mark, or a single Arabic word enclosed by "(%" and ")",
which represents an annotation of a non-speech condition or event (e.g. "music", "noise",
"laugh", etc).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2006T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634212
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 898-857-291-160-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ISI Arabic-English Automatically Extracted Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This distribution contains a corpus of Arabic-English parallel sentences, which were
extracted automatically from two monolingual corpora: Arabic Gigaword Second Edition
(LDC2006T02) and English Gigaword Second Edition (LDC2005T12). The data was extracted
from news articles published by Xinhua News Agency and Agence France Presse and was
obtained using the automatic parallel sentence identification method described in
the following publication: Dragos Stefan Munteanu, Daniel Marcu, 2005. Machine Translation
Performance by Exploiting Non-parallel Corpora, Computational Linguistics, 31(4):477-504
The corpus contains 1,124,609 sentence pairs; the word count on the English side is
approximately 31M words. The sentences in the parallel corpus preserve the form and
encoding of the texts in the original Gigaword corpora. For each sentence pair in
the corpus the authors provide the names of the documents from which the two sentences
were extracted, as well as a confidence score (between 0.5 and 1.0), which is indicative
of their degree of parallelism. The parallel sentence identification approach is designed
to judge sentence pairs in isolation from their contexts, and can therefore find parallel
sentences within document pairs which are not parallel. The fact that two documents
share several parallel sentences does not necessarily mean the documents are parallel.
In order to make this resource useful for research in Machine Translation (MT), the
authors made efforts to detect potential overlaps between this data and the standard
test and development data sets used by the MT community. The NIST 2002-2005 MT evaluation
data sets contain several articles from Xinhua News Agency and Agence France Presse.
Sentence pairs in this distribution that have a 7-gram overlap with a sentence pair
in a NIST MT evaluation set or sentence pairs coming from documents whose names are
similar to those in the NIST MT sets are marked with a negative confidence score.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
History, Modern
- Form subdivision:
Databases.
- Chronological subdivision:
1989-
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Munteanu, Dragos Stefan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcu, Daniel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634220
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 224-310-954-973-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ISI Chinese-English Automatically Extracted Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for ISI Chinese-English Automatically Extracted Parallel
Text, Linguistic Data Consortium (LDC) catalog number LDC2007T09 and isbn 1-58563-422-0.
This distribution contains a corpus of Chinese-English parallel sentences, which were
extracted automatically from two monolingual corpora: Chinese Gigaword Second Edition
(LDC2006T02) and English Gigaword Second Edition (LDC2005T14). The data was extracted
from news articles published by Xinhua News Agency and was obtained using the automatic
parallel sentence identification method described in the following publication: Dragos
Stefan Munteanu, Daniel Marcu, 2005. Improving Machine Translation Performance by
Exploiting Non-parallel Corpora, Computational Linguistics, 31(4):477-504 The corpus
contains 558,567 sentence pairs the word count on the English side is approximately
16M words. The sentences in the parallel corpus preserve the form and encoding of
the texts in the original Gigaword corpora. For each sentence pair in the corpus the
authors provide the names of the documents from which the two sentences were extracted,
as well as a confidence score (between 0.5 and 1.0), which is indicative of their
degree of parallelism. The parallel sentence identification approach is designed to
judge sentence pairs in isolation from their contexts, and can therefore find parallel
sentences within document pairs which are not parallel. The fact that two documents
share several parallel sentences does not necessarily mean the documents are parallel
In order to make this resource useful for research in Machine Translation (MT), the
authors made efforts to detect potential overlaps between this data and the standard
test and development data sets used by the MT community. The NIST 2002-2005 MT evaluation
data sets contain several articles from Xinhua News Agency. Sentence pairs in this
distribution that have a 7-gram overlap with a sentence pair in a NIST MT evaluation
set or sentence pairs coming from documents whose names are similar to those in the
NIST MT sets are marked with a negative confidence score.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
History, Modern
- Form subdivision:
Databases.
- Chronological subdivision:
1989-
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Munteanu, Dragos Stefan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcu, Daniel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634352
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 856-715-289-369-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
MITRE 1997 Mandarin Broadcast News Speech Translations (HUB-4NE)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
MITRE 1997 Mandarin Broadcast News Transcripts Translations (HUB-4NE) was developed
by The MITRE Corporation and contains segment-aligned English translations of the
1997 DARPA HUB4-NE Mandarin transcripts. The original transcripts and the corresponding
broadcast news audio are available as separate LDC publications, 1997 Mandarin Broadcast
News Transcripts (HUB4-NE) (LDC98T24) and 1997 Mandarin Broadcast News Speech (HUB4-NE)
(LDC98S73). The source data is comprised of 30 hours of recorded Mandarin broadcasts
collected by the LDC in 1997 from Voice of America, China Central TV and KAZN-AM,
a commercial radio station based in Los Angeles, CA. The original transcript segmentation
is suitable for speech recognition, but does not support machine translation and machine
translation evaluation. Therefore, the Mandarin side of these aligned transcripts
was resegmented for this release. In all other respects, the Mandarin transcripts
in this publication are identical to the original transcripts. The dataset in this
release consists of 376K words of English text and 517K characters of Mandarin text.
The English text was produced by translators with no access to the original audio.
The translators were given specific guidelines for translation, and those are included
in this distribution. A portion of the source data (6%) was translated four times
in order to support experiments in translation evaluation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
- General subdivision:
Language
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doran, Christine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Henderson, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Justin
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634379
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007V01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 456-764-238-567-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TRECVID 2005 Keyframes & Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007V01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for TRECVID 2005 Keyframes & Transcripts, Linguistic
Data Consortium (LDC) catalog number LDC2007V01 and isbn 1-58563-437-9. TREC Video
Retrieval Evaluation (TRECVID) is sponsored by the National Institute of Standards
and Technology (NIST) to promote progress in content-based retrieval from digital
video via open, metrics-based evaluation. The keyframes in this release were extracted
for use in the NIST TRECVID 2005 Evaluation. TRECVID is a laboratory-style evaluation
that attempts to model real world situations or significant component tasks involved
in such situations. In 2005 there were four main tasks with associated tests: * shot
boundary determination * low-level feature extraction * high-level feature extraction
* search (interactive, manual, and automatic) For a detailed description of the TRECVID
Evaluation Tasks, please refer to the NIST TRECVID 2005 Evaluation Description. *Data*
The source data is Arabic, Chinese and English language broadcast programming collected
in November 2004 from the following sources: Lebanese Broadcasting Corp. (Arabic);
China Central TV and New Tang Dynasty TV (Chinese); and CNN and MSNBC/NBC (English).
Shots are fundamental units of video, useful for higher-level processing. To create
the master list of shots, the video was segmented. The results of this pass are called
subshots. Because the master shot reference is designed for use in manual assessment,
a second pass over the segmentation was made to create the master shots of at least
2 seconds in length. These master shots are the ones used in submitting results for
the feature and search tasks in the evaluation. In the second pass, starting at the
beginning of each file, the subshots were aggregated, if necessary, until the currrent
shot was at least 2 seconds in duration, at which point the aggregation began anew
with the next subshot. The keyframes were selected by going to the middle frame of
the shot boundary, then parsing left and right of that frame to locate the nearest
I-Frame. This then became the keyframe and was extracted. Keyframes have been provided
at both the subshot (NRKF) and master shot (RKF) levels. In a small number of cases
(all of them subshots) there was no I-Frame within the subshot boundaries. When this
occured, the middle frame was selected. There is one anomaly: at the end of the first
video in the test collection, a subshot occurs outside a master shot.) The emphasis
in the common shot boundary reference is on the shots, not the transitions. The shots
are contiguous. There are no gaps between them. They do not overlap. The media time
format is based on the Gregorian day time (ISO 8601) norm. Fractions are defined by
counting pre-specified fractions of a second.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Television broadcasting of news
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wilkins, Peter
ADDED ENTRY--PERSONAL NAME
- Personal name:
Petersohn, Christian
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007V01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634417
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 171-422-435-824-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2001 Topic Annotated Enron Email Data Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The 2001 Topic Annotated Enron Email Data Set contains approximately 5000 (4936) emails
from Enron Corporation (Enron) manually indexed into 32 topics. It is a subset of
the original Enron Email Data Set of 1.5 million emails that was posted on the Federal
Energy Regulatory Commission website as a matter of public record during the investigation
of Enron. The original set suffered from document integrity problems; attempts were
made to improve the quality of the data and to remove some sensitive and private information.
Dr. William Cohen of Carnegie Mellon University took the lead in distributing the
improved corpus, consisting of 517,431 Enron employee emails that covered the period
1999-2002. This corpus is a subset of the Carnegie Mellon data set and covers the
period from January 2001 to December 2001. The email topics reflect the business activities
and interests of Enron employees in that year: California energy problems and the
subsequent state and Federal investigations, Enron's downfall (newsfeeds and interoffice
communications), Enron's venture with the Dabhol India Power Company, Enrononline
(Enron's trading infrastructure), competitors (Dynegy, El Paso Pipeline) and even
fantasy football and college football. Eliminated from this data set are duplicates,
emails that are too small and emails that are not really topics but are types (personnel
memos and personal quips). The manual indexing was performed in the summer of 2006
by two people who worked closely together: a research associate familiar with the
Enron saga and a junior in economics at the University of Tennessee. The original
Enron Email Data Set is the first large email set made available to researchers, but
until now there has been no ability to assess the performance of topic detection and
tracking algorithms with the email set. Having an annotated subset such as this one
should provide text mining researchers with a way to evaluate the accuracy of new
algorithms for clustering and classification. This data set can also be used to provide
communication context for researchers using the Enron Email Data Set in social network
analysis. Previous annotations such as the one developed at UC Berkeley have been
primarily based on email type rather than the specific topic(s) of discussion. This
annotation can be used to qualify the discussion topics between individuals and groups
comprising a social network of Enron employees. Due to the complexity of this corpus'
directory structure, it will be distributed as compressed tar file on a cd. Most compression
utilities will uncompress the package.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Electronic mail messages
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Berry, Michael W.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Browne, Murray
ADDED ENTRY--PERSONAL NAME
- Personal name:
Signer, Ben
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634522
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 570-571-401-317-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Distillation Training
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 1 Distillation Training, Linguistic Data Consortium (LDC) catalog number
LDC2007T20 and isbn 1-58563-452-2, constitutes the final release of training data
created by LDC for the DARPA GALE Program Phase 1 Distillation technology evaluation.
Distillation is one of three primary technology components for the DARPA GALE Program,
along with Transcription and Translation. Distillation engines respond to queries
from English-speaking users, delivering pertinent, consolidated information in easy-to-understand
forms. The distillation engine processes English and foreign language material, both
speech and text, from multiple sources and documents, removing redundancy and presenting
an integrated response to the user. This release consists of 248 English, Chinese
and/or Arabic queries and their responses, created by LDC annotators. Queries conform
to one of ten template types. Query responses may include document and snippet relevance
judgments, nuggets, nugs and supernugs. 158 of the 248 queries have been annotated
for all features, while the remainder are labeled for only some features. In addition,
not all queries have been exhaustively annotated for a given feature, given resource
constraints during corpus development. The table below indicates the number of queries
that have been labeled for each template in each source language. English Chinese
Arabic Template 1 15/28 9/17 12/16 Template 3 16/29 9/29 13/29 Template 4 15/23 7/18
11/18 Template 5 21/39 10/39 20/36 Template 6 15/20 7/19 7/20 Template 8 12/14 6/13
5/14 Template 9 14/23 7/21 10/21 Template 11 11/22 8/15 2/14 Template 15 12/21 8/11
5/11 Template 16 13/24 10/12 8/12 Total 144/243 81/194 93/191 *Annotation* The annotation
task involves responding to a series of user queries. For each query, annotators first
find relevant documents and identify snippets (strings of contiguous text that answer
the query) in the Arabic, Chinese or English source document. Annotators then create
a nugget for each fact expressed in the snippet. Semantically equivalent nuggets are
grouped into cross-language, cross-document "supernugs". Judges at BAE Systems finally
provide relevance weights for each supernug. Queries in this release have been annotated
for the following tasks: * searching for relevant documents and providing yes/no judgements
* extracting snippets * resolution of pronouns, and certain types of temporal and
locative expressions contained in the snippets * creating nuggets, i.e. atomic pieces
of information that an annotator considers a valid answer to the query * building
nugs, i.e. clusters of semantically-equivalent nuggets for each language * building
supernugs, i.e. clusters of semantically-equivalent nugs across languages
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Babko-Malaya, Olga
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zakhary, Ramez
ADDED ENTRY--PERSONAL NAME
- Personal name:
Medero, Julie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634360
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007V02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 558-793-302-438-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TRECVID 2003 Keyframes & Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007V02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The TREC Video Retrieval Evaluation (TRECVID) is sponsored by the National Institute
of Standards and Technology (NIST) to promote progress in content-based retrieval
from digital video via open, metrics-based evaluation. The keyframes in this release
were extracted for use in the NIST TRECVID 2003 Evaluation. TRECVID is a laboratory-style
evaluation that attempts to model real world situations or significant component tasks
involved in such situations. In 2003 there were four main tasks with associated tests:
* shot boundary determination * story segtmentation * high-level feature extraction
* search (interactive and manual) For a detailed description of the TRECVID Evaluation
Tasks, please refer to the NIST TRECVID 2003 Evaluation Description. *Data* The source
data is English language broadcast programming collected by LDC in 1998 from ABC ("World
News Tonight") and CNN ("CNN Headline News"). Shots are fundamental units of video,
useful for higher-level processing. To create the master list of shots, the video
was segmented. The results of this pass are called subshots. Because the master shot
reference is designed for use in manual assessment, a second pass over the segmentation
was made to create the master shots of at least 2 seconds in length. These master
shots are the ones used in submitting results for the feature and search tasks in
the evaluation. In the second pass, starting at the beginning of each file, the subshots
were aggregated, if necessary, until the currrent shot was at least 2 seconds in duration,
at which point the aggregation began anew with the next subshot. The keyframes were
selected by going to the middle frame of the shot boundary, then parsing left and
right of that frame to locate the nearest I-Frame. This then became the keyframe and
was extracted. Keyframes have been provided at both the subshot (NRKF) and master
shot (RKF) levels. In a small number of cases (all of them subshots) there was no
I-Frame within the subshot boundaries. When this occured, the middle frame was selected.
There is one anomaly: at the end of the first video in the test collection, a subshot
occurs outside a master shot.) The emphasis in the common shot boundary reference
is on the shots, not the transitions. The shots are contiguous. There are no gaps
between them. They do not overlap. The media time format is based on the Gregorian
day time (ISO 8601) norm. Fractions are defined by counting pre-specified fractions
of a second.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Television broadcasting of news
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Quenot, Georges
ADDED ENTRY--PERSONAL NAME
- Personal name:
Over, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007V02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634409
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 722-221-552-342-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
OntoNotes Release 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Natural language applications like machine translation, question answering, and summarization
currently are forced to depend on impoverished text models like bags of words or n-grams,
while the decisions that they are making ought to be based on the meanings of those
words in context. That lack of semantics causes problems throughout the applications.
Misinterpreting the meaning of an ambiguous word results in failing to extract data,
incorrect alignments for translation, and ambiguous language models. Incorrect coreference
resolution results in missed information (because a connection is not made) or incorrectly
conflated information (due to false connections). Some richer semantic representation
is badly needed. The OntoNotes project is a collaborative effort between BBN Technologies,
the University of Colorado, the University of Pennsylvania, and the University of
Southern California's Information Sciences Institute to produce such a resource. It
aims to annotate a large corpus comprising various genres of text (news, conversational
telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English,
Chinese, and Arabic) with structural information (syntax and predicate argument structure)
and shallow semantics (word sense linked to an ontology and coreference). OntoNotes
builds on two time-tested resources, following the Penn Treebank for syntax and the
Penn PropBank for predicate-argument structure. Its semantic representation will include
word sense disambiguation for nouns and verbs, with each word sense connected to an
ontology, and coreference. The current goals call for annotation of over a million
words each of English and Chinese, and half a million words of Arabic over five years.
The authors wish to make this resource available to the natural language research
community so that decoders for these phenomena can be trained to generate the same
structure in new documents. Lessons learned over the years have shown that the quality
of annotation is crucial if it is going to be used for training machine learning algorithms.
Taking this cue, we ensure that each layer of annotation in OntoNotes will have at
least 90% inter- annotator agreement. Our pilot studies have shown that predicate
structure, word sense, ontology linking, and coreference can all be annotated rapidly
and with better than 90% consistency.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Weischedel, Ralph
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pradhan, Sameer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ramshaw, Lance
ADDED ENTRY--PERSONAL NAME
- Personal name:
Micciulla, Linnea
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitchell
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taylor, Ann
ADDED ENTRY--PERSONAL NAME
- Personal name:
Babko-Malaya, Olga
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hovy, Eduard
ADDED ENTRY--PERSONAL NAME
- Personal name:
Belvin, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Houston, Ann
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634476
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 293-371-412-539-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2004 Spring NIST Rich Transcription (RT-04S) Development Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2004 NIST Spring Rich Transcription (RT-04S) Development Data contains the test material
(meeting speech and reference transcripts) used in the RT-04S evaluation administered
by the NIST (National Institute of Standards and Technology) Speech Group. Rich Transcription
(RT) is broadly defined as a fusion of speech-to-text technology and metadata extraction
technologies designed to provide the basis for a generation of more usable transcriptions
of human-human meeting speech. The data in this release contains portions of meeting
speech collected, and/or transcribed by the International Computer Science Institute
(ICSI) at Berkeley, the Interactive Systems Laboratories (ISL) at Carnegie Mellon
University, NIST and LDC. The complete meeting speech and corresponding transcript
data sets are available from LDC's catalog as follows: ICSI Meeting Speech (LDC2004S02),
ICSI Meeting Transcripts (LDC2004T04), ISL Meeting Speech Part 1 (LDC2004S05), ISL
Meeting Transcripts Part 1 (LDC2004T10), NIST Meeting Pilot Corpus Speech (LDC2004S09)
and NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13). The RT-04S development
data consists of the 80-minute test set used in the RT-02 Meeting Recognition Evaluation,
specifcally, approximately 10 minutes of recordings of eight meetings held at ISCI,
CMU, LDC and NIST. For RT-04S, NIST re-released that data with additional distant
mics (if the data collection sites provided them). Although the development data is
comprised of 10-minute excerpts from the same data collection sites which are represented
in the RT-04S evaluation data set (2004 Spring NIST Rich Transcription (RT-04S) Evaluation
Data, LDC2007S12), it is not completely reflective of the evaluation test data since
it contains lapel mics in lieu of head mics for the LDC and CMU data and some different
distant mics for LDC data. For more information about the development test data, see
NIST's RT-04S Development Data Documentation. RT-04S included the following tasks
in the meeting domain: Speech-to-Text Transcription (STT) tasks Microphone conditions:
* Multiple distant microphones * Single distant microphone * Individual head microphone
Processing time conditions: * Unlimited time STT * Less than or equal to twenty times
realtime * Less than or equal to ten times realtime * Less than or equal to one times
realtime Diarization (SPKR) task (who spoke when) Microphone conditions: * Multiple
distant microphones * Single distant microphone Input conditions: * Speech input only
* Speech plus reference transcript input Processing time conditions: * Unlimited time
* Less than or equal to twenty times realtime * Less than or equal to ten times realtime
* Less than or equal to one time realtime
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Le, Audrey
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sanders, Greg
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634484
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 581-401-882-415-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data contains the test material
(meeting speech and reference transcripts) used in the RT-04S evaluation administered
by the NIST (National Institute of Standards and Technology) Speech Group. Rich Transcription
(RT) is broadly defined as a fusion of speech-to-text technology and metadata extraction
technologies designed to provide the basis for a generation of more usable transcriptions
of human-human meeting speech. The data in this release consists of portions of meeting
speech collected and/or transcribed by the International Computer Science Institute
(ICSI) at Berkeley, the Interactive Systems Laboratories (ISL) at Carnegie Mellon
University, NIST and LDC. The complete meeting speech and corresponding transcript
data sets are available from LDC's catalog as follows: ICSI Meeting Speech (LDC2004S02),
ICSI Meeting Transcripts (LDC2004T04), ISL Meeting Speech Part 1 (LDC2004S05), ISL
Meeting Transcripts Part 1 (LDC2004T10), NIST Meeting Pilot Corpus Speech (LDC2004S09)
and NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13). RT-04S included
the following tasks in the meeting domain: Speech-to-Text Transcription (STT) tasks
Microphone conditions: * Multiple distant microphones * Single distant microphone
* Individual head microphone Processing time conditions: * Unlimited time STT * Less
than or equal to twenty times realtime * Less than or equal to ten times realtime
* Less than or equal to one times realtime Diarization (SPKR) task (?who spoke when?)
Microphone conditions: * Multiple distant microphones * Single distant microphone
Input conditions: * Speech input only * Speech plus reference transcript input Processing
time conditions: * Unlimited time * Less than or equal to twenty times realtime *
Less than or equal to ten times realtime * Less than or equal to one time realtime
Futher information about the evaluation is available on the RT-04 Spring Evaluation
Website.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Le, Audrey
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sanders, Greg
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634506
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T36
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 616-484-921-813-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Treebank 6.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T36
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for Chinese Treebank 6.0, Linguistic Data Consortium
(LDC) catalog number LDC2007T36 and isbn 1-58563-450-6. The Chinese Treebank project
began at the University of Pennsylvania in 1998 and continues at Penn and the University
of Colorado. Chinese Treebank 6.0 is the latest version produced from this effort,
consisting of 780,000 words (over 1.28 million Chinese characters) that are segmented,
part-of-speech tagged and fully bracketed. The data sources include newswire from
Xinhua News Agency, articles from Sinorama Magazine, news from the website of the
Hong Kong Special Administrative Region and transcripts from various broadcast news
programs. The LDC published Chinese Treebank 1.0 in 2000; it was later corrected and
released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately
100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version
containing roughly 400,000 words, in 2004. A year later, the LDC published the 500,000
word Chinese Treebank 5.0 (LDC2005T01). For information about Chinese Treebank methodology
and guidelines, consult the attached documentation files and the Penn-CU Chinese Treebank
Project website. This release encompasses 2,036 text files, containing 28,295 sentences,
781,351 words and 1,285,149 hanzi (Chinese characters). The data is provided in two
encodings: GBK and UTF-8, and the annotation has Penn Treebank-style labeled brackets.
Details of the annotation standard can be found in the enclosed segmentation, POS-tagging
and bracketing guidelines. The data is provided in four different formats: raw text,
word segmented, word segmented and POS-tagged, and syntactically bracketed.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
- Geographic subdivision:
China
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chiou, Fu-Dong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jiang, Zixin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chang, Meiyu
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T36
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634433
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 231-652-533-779-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This release is Part 1 of the three-part GALE Phase 1 Chinese Broadcast News Parallel
Text, which, along with other corpora, was used as training data in year 1 (Phase
1) of the DARPA-funded GALE program. This corpus contains transcripts and English
translations of 23.3 hours of Chinese broadcast news programming. As indicated below,
a small number of audio files corresponding to the text in this corpus have been previously
released. *Source Data* A total of 23.3 hours of Chinese broadcast news programming
was selected from two sources, China Central TV (CCTV) (a broadcaster from Mainland
China) and Phoenix TV (a Hong Kong-based satellite TV station). The transcripts and
translations represent recordings of five different programs. A manual selection procedure
was used to choose data appropriate for the GALE program, namely, news programs focusing
on current events. Stories on topics such as sports, entertainment news, and stock
markets were excluded from the data set. The following table is a summary of the files
included in this release.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2003 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585632678
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2003S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 124-056-444-354-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Korean Telephone Conversations Complete Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2003]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2003S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Korean Telephone Conversations Complete Set was produced by Linguistic Data Consortium
(LDC) catalog number LDC2003P01 and ISBN 1-58563-267-8. The complete set of Korean
Telephone Conversations consists of the following: * Korean Telephone Conversations
Speech * Korean Telephone Conversations Transcripts * Korean Telephone Conversations
Lexicon The Korean telephone conversations were originally recorded as part of the
Callfriend project. The Callfriend Korean telephone speech was collected by Linguistic
Data Consortium primarily in support of the Language Identification (LID) project,
sponsored by the U.S. Department of Defense. The calls were later transcribed for
use in other projects. Korean Telephone Conversations Speech consists of 100 telephone
conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the
remaining 51 are previously unexposed calls. The recorded conversations are between
native speakers of Korean and last up to 30 minutes, of which the transcribed speech
covers between 15 and 18 minutes. All speakers were aware that they were being recorded.
They were given no guidelines concerning what they should talk about. Once a caller
was recruited to participate, he/she was given a free choice of whom to call. Most
participants called family members or close friends. All calls originated in either
the United States or Canada. Korean Telephone Conversations Transcripts consists of
100 text files, totalling approximately 190K words and 25K unique words. All files
are in Korean orthography: orthographic Korean characters are in Hangul, encoded in
KSC5601 (Wansung) system. Please follow this link for a sample transcript: txt | gif.
Korean Telephone Conversations Lexicon covers the tokens occurring in the Korean Telephone
Conversations Transcripts. The lexicon contains five tab-separated information fields:
* orthographic form in Hangul (head-word), encoded in the KSC-5601 (Wansung) system
* orthographic form in Yale romanization * pronunciation * frequency of the word in
Korean Telephone Conversations Transcripts * morphological analysis of the word Please
follow this link for a sample page from the lexicon: txt | gif.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Han, Na-Rae
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ko, Eon-Suk
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martey, Nii
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kim, Myeonchul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2003S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634557
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T38
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 222-703-436-942-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Gigaword Third Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T38
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Gigaword Third Edition is a comprehensive archive of newswire text data that
has been acquired over several years by the LDC. This edition includes all of the
contents in Chinese Gigaword Second Edition (LDC2005T14) as well as new data collected
after the publication of that edition. Also, an archive of articles from a new newswire
source (Agence France Presse) has been added in the third edition. The four distinct
international sources of Chinese newswire included in this edition are the following:
* Agence France Presse (afp_cmn) * Central News Agency, Taiwan (cna_cmn) * Xinhua
News Agency (xin_cmn) * Zaobao Newspaper (zbn_cmn) The seven-letter codes in the parentheses
above are used for the directory names and data files for each source, and are also
used (in ALL_CAPS) as part of the unique DOC "id" string assigned to each news article.
*Data* The original data archives received by the LDC from Agence France Presse, Xinhua
News Agency and Zaobao were encoded in GB-2312, whereas those from Central News Agency
(CNA) were encoded in Big-5. To avoid the problems and confusion that could result
from differences in character-set specifications, all text files in this corpus have
been converted to UTF-8 character encoding. *New in the Third Edition* * Over six
years worth of articles (October 2000 through December 2006) from Agence France Presse
are being released for the first time. * Two years worth of new articles (January
2005 through December 2006) have been added to the Xinhua data set. * Nearly two years
worth of content was added to the CNA data set. There was a gap in the LDC's collection
from this source during 2006: no CNA Chinese content was collected between July 27
and December 17 2006, inclusive, so there are no data files for August through November
of that year, and the December data file is about half its expected size. * A small
set of older stories (October through December 1998) have been added from Zaobao;
these were previously published by LDC as part of TDT3 Multilanguage Text Version
2.0 (LDC2001T58) and are being included in Gigaword for the first time.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
History, Modern
- Form subdivision:
Databases.
- Chronological subdivision:
1989-
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T38
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007S18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 965-489-670-052-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Kids` Speech Version 1.1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007S18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CSLU: Kids' Speech Version 1.1 , Linguistic Data Consortium (LDC) catalog number LDC2007S18
and isbn 1-58563-395-X, is a collection of spontaneous and prompted speech from 1100
children between Kindergarten and Grade 10 in the Forest Grove School District in
Oregon. All children -- approximately 100 children at each grade level -- read approximately
60 items from a total list of 319 phonetically-balanced but simple words, sentences
or digit strings. Each utterance of spontaneous speech begins with a recitation of
the alphabet and contains a monologue of about one minute in duration. This release
consists of 1017 files containing approximately 8-10 minutes of speech per speaker.
Corresponding word-level transcriptions are also included. This corpus was developed
to facilitate research about the characteristics of children's speech at different
ages and to train and evaluate recognizers for use in language training and other
interactive tasks involving children, including to train recognizers used in language
development with deaf children. *Data* Data collection was performed using the CSLU
Speech Toolkit and two computers running Windows NT 4.0. Each computer was manned
by a CSLU staff member who monitored progress and helped the child with any difficulties.
The average time at the computer was 20 minutes, yielding approximately 8-10 minutes
of speech digitized at 16 bits and 16kHz using Soundblaster 16 PnP audio cards with
head-mounted microphones. The prompted speech, consisting of 200 isolated words and
10 numeric strings, was presented as text appearing below an animated character that
produced accurate visible speech synchronized with recorded prompts. A text prompt
was also displayed. The child then reproduced the prompted word. Once the prompted
speech collection was completed, the experimenter then asked the subject a series
of questions designed to elicit spontaneous speech (i.e "Tell me about your favorite
movie").
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shobaki, Khaldoun
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hosom, John-Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007S18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2007 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634603
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2007T40
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 769-869-926-619-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Gigaword Third Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2007]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2007T40
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Gigaword Third Edition is a comprehensive archive of newswire text data acquired
from Arabic news sources by the LDC at the University of Pennsylvania. Arabic Gigaword
Third Edition includes all of the content of Arabic Gigaword Second Edition (LDC2006T02)
as well as new data collected after the publication of that edition. Also, an archive
from a new newswire source -- Assabah -- has been included in the third editon. The
six distinct sources of Arabic newswire represented in the third edition are: * Agence
France Presse (afp_arb) * Assabah (asb_arb) * Al Hayat (hyt_arb) * An Nahar (nhr_arb)
* Ummah Press (umh_arb) * Xinhua News Agency (xin_arb) The seven-character codes in
the parantheses above consist of the three-character source name IDs and the three-character
language code ("arb") separated by an underscore ("_") character. The epochs and document
counts for the data in the third edition are set forth below: Newly Added Data Source
Date Span Document Count Agence France Presse 2005.01 - 2006.12 137815 Assabah News
Agency 2004.09 - 2006.12 15410 (new source) Al Hayat News Agency 2005.01 - 2006.1
8799 (no data for 2004) An Nahar News Agency 2005.01 - 2006.12 104950 (no data for
2004) Xinhua News Agency 2005.01 - 2006.12 135472 *Data* This release contains 547
files, totalling approximately 1.8GB in compressed form (6,673 MB uncompressed) and
1,994,735 K-words. The table below shows data quantity by source under the following
categories: data source (Source); the number of files per source (#Files); compressed
file size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated
words tokens in the text (K-words) and the number of documents per source (#DOCs).
Data Sources and Quanities Source #Files Gzip-MB Totl-MB K-wrds #DOCs afp_arb 152
441 1806 147612 798436 asb_arb 28 23 77 6587 15410 hyt_arb 142 559 1932 171502 378353
nhr_arb 134 612 2172 193732 449340 umh_arb 24 4 14 1201 4645 xin_arb 67 171 672 56165
348551 TOTAL 547 1810 6673 576799 1994735 All text files in this corpus have been
converted to UTF-8 character encoding. Certain data and formatting issues observed
in previous releases of Arabic Gigaword have been normalized in the third edition:
* Approximately 15,000 stories from older AFP files (1994 - 2002) contained very brief
documents where the text content was not recognized as such; in those cases, the TEXT
element appeared empty while the HEADLINE element contained anywhere from three to
several lines of text. The content of these documents has been rearranged. The first
line remains as the headline and the rest of the lines have been moved into the text
segment. All stories of this sort had been originally classified as "other", and that
classification has not been changed in this edition. * Al Hayat data from 2002 and
2003 contained some Arabic-Indic digits, despite the intention to convert all digit
strings to the ASCII digit characters for consistency. The digits have now been converted
to the ASCII range. For more details about the encoding challenges presented by this
data, see the readme file accompanying this corpus. * Some Al Hayat data had stray
angle-bracket characters (""), which have been rendered as "". There were also some
defective "Doc-ID" strings (the 'id' attribute in the "" tag that begins each news
story) in the January 2001 data. * Some An Nahar data had "bare" ampersand characters
("&") which have been rendered as "&". * Some Xinhua documents included empty sub-elements
(HEADLINE, DATELINE and/or TEXT sections containing no data); when HEADLINE or DATELINE
were empty, these tags were removed. When the TEXT segment was empty, the document
as a whole was removed. * In several Xinhua stories, the Doc-ID string, which is supposed
to provide the year, month, date and sequence number for the story, had become garbled,
yielding an incorrect or impossible date string. A separate data file in the "docs"
directory, called "docid_changes.txt", lists the changes in document inventory and
Doc-ID strings. * Xinhua stories typically end with a formulaic Arabic string (meaning
"end-of-story"), which should not have been included as part of the final paragraph
in each story. * In general, consistent line-wrapping was applied to make the overall
text presentation consistent across all sources and with Gigaword releases in other
languages. The markup pattern was also applied consistently for all sources without
exception.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
History, Modern
- Form subdivision:
Databases.
- Chronological subdivision:
1989-
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2007T40
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u hun d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634611
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 694-868-944-045-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
hun
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
hun
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Hungarian-English Parallel Text, Version 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Hungarian-English Parallel Text, Version 1.0 (also known as the "Hunglish Corpus")
is a sentence-aligned Hungarian-English parallel corpus consisting of approximately
two million sentence pairs. The corpus contains additional language resources for
the Hungarian text, including a monolingual corpus, morphological toolset and aligner.
Hungarian-English Parallel Text, Version 1.0 is a joint work of the Media Research
and Education Center at the Budapest University of Technology and Economics (BUTE)
and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute
of Linguistics. Additional information about this release is available from the corpus
website maintained by BUTE. *File formats, character encoding* This publication is
issued on CD as a tarred zip file. Commonly available utilities such as Gnu Zip or
Stuffit will readily extract this publication from its compressed form. Sentence pair
(.bi) files consist of tab-separated, matching sentence pairs. The .bi files do not
contain segments where deletion or contraction occurred. They are also filtered based
on quality, so the full reconstruction of the raw texts is impossible. Some .bi files
were shuffled (sorted alphabetically). Alignment "ladder" (.lad) files preserve the
whole of both input texts with ordering, even those segments that were not successfully
aligned. In .lad files, every line is tab-separated into two columns. The first is
a segment of the Hungarian text. The second is a (supposedly corresponding) segment
of the English text. Such segments of the source or target text will generally consist
of exactly one sentence on both sides, but can also consist of zero, or more than
one, sentence. In the latter case, the special separating token " ~~~ " is placed
between sentences. The encoding of the sentence pair and the alignment files is mixed:
ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming
majority of the texts use compatible subsets of these two encodings, so for viewing,
the files can be considered ISO Latin-2 encoded. hu and en are the raw texts used,
in ISO Latin-2 and ISO Latin-1 encoding respectively.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Hungarian. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Varga, Dániel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Németh, László
ADDED ENTRY--PERSONAL NAME
- Personal name:
Halácsy, Péter
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kornai, András
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u ara d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 461-663-437-911-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Arabic Blog Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains the documentation for GALE Phase 1 Arabic Blog Parallel Text, Linguistic
Data Consortium (LDC) catalog number LDC2008T02, ISBN 1-58563-462-X. Blogs are posts
to informal web-based journals of varying topical content. GALE Phase 1 Arabic Blog
Parallel Text was prepared by the LDC and consists of 102K words (222 files) of Arabic
blog text and its English translation from thirty-three sources. This release was
used as training data in Phase 1 of the DARPA-funded GALE program. LDC has released
the following GALE Phase 1 & 2 Arabic Parallel Text data sets: * GALE Phase 1 Arabic
Broadcast News Parallel Text - Part 1 (LDC2007T24) * GALE Phase 1 Arabic Broadcast
News Parallel Text - Part 2 (LDC2008T09) * GALE Phase 1 Arabic Blog Parallel Text
(LDC2008T02) * GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03) *
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09) * GALE Phase 2 Arabic
Broadcast Conversation Parallel Text Part 1 (LDC2012T06) * GALE Phase 2 Arabic Broadcast
Conversation Parallel Text Part 2 (LDC2012T14) * GALE Phase 2 Arabic Newswire Parallel
Text (LDC2012T17) * GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18)
* GALE Phase 2 Arabic Web Parallel Text (LDC2013T01) *Source Data* The task of preparing
this corpus involved four stages of work: data scouting, data harvesting, formatting,
and data selection. Data scouting involved manually searching the web for suitable
blog text. Data scouts were assigned particular topics and genres along with a production
target in order to focus their web search. Formal annotation guidelines and a customized
annotation toolkit helped data scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest (sites, threads
and posts) to a database. A nightly process queried the annotation database and harvested
all designated URLs. Whenever possible, the entire site was downloaded, not just the
individual thread or post located by the data scout. Once the text was downloaded,
its format was standardized (by running various scripts) so that the data could be
more easily integrated into downstream annotation processes. Original-format versions
of each document were also preserved. Typically a new script was required for each
new domain name that was identified. After scripts were run, an optional manual process
corrected any remaining formatting problems. The selected documents were then reviewed
for content suitability using a semi-automatic process. A statistical approach was
used to rank a documents relevance to a set of already-selected documents labeled
as good. An annotator then reviewed the list of relevance-ranked documents and selected
those which were suitable for a particular annotation task or for annotation in general.
Those newly-judged documents in turn provided additional input for the generation
of new ranked lists. Manual sentence units/segments (SU) annotation was also performed
on a subset of files following LDCs Quick Rich Transcription specification. Three
types of end of sentence SU are identified: - statement SU - question SU - incomplete
SU *Translation* After files were selected, they were reformatted into a human-readable
translation format, and the files were then assigned to professional translators for
careful translation. Translators followed LDCs GALE Translation guidelines, which
describe the makeup of the translation team, the source, data format, the translation
data format, best practices for translating certain linguistic features (such as names
and speech disfluencies), and quality control procedures applied to completed translations.
Translators were instructed to return a 50-sentence sample as soon as it was completed.
The sample was reviewed by LDCs bilingual language specialists. Subsequent deliveries
were subject to quality controls as described in the translation guidelines. Low quality
translations were returned to the translators for revision. TDF Format All final data
are in Tab Delimited Format (TDF). TDF is compatible with other transcription formats,
such as the Transcriber format and AG format, and it is easy to process. Each line
of a TDF file corresponds to a speech segment and contains 13 tab delimited fields:
field data_type ----- --------- 1 file unicode 2 channel int 3 start float 4 end float
5 speaker unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript unicode
9 section int 10 turn int 11 segment int 12 sectionType unicode 13 suType unicode
A source TDF file and its translation are the same except that the transcript in the
source TDF is replaced by its English translation. Encoding All data are encoded in
UTF8. *Sponsorship* This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication
does not necessarily reflect the position or the policy of the Government, and no
official endorsement should be inferred. *samples* For an example of the data in this
corpus, please examine these screen captures(jpg) of the text: * source * translation
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and English. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Blogs
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zakhary, Dalal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634581
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 472-226-418-389-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ACE 2005 English SpatialML Annotations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The ACE (Automatic Content Extraction) program focuses on developing automatic content
extraction technology to support automatic processing of human language in text form.
The kind of information recognized and extracted from text includes entities, values,
temporal expressions, relations and events. SpatialML is a mark-up language for representing
spatial expressions in natural language documents. SpatialML's focus is primarily
on geography and culturally-relevant landmarks, rather than biology, cosmology, geology,
or other regions of the spatial language domain. The goal is to allow for potentially
better integration of text collections with resources such as databases that provide
spatial information about a domain, including gazetteers, physical feature databases
and mapping services. In ACE 2005 English SpatialML Annotations, the authors applied
SpatialML tags to the English training data (originally annotated for entities, relations
and events) in ACE 2005 Multilingual Training Corpus, LDC2006T06. (NOTE: 2005 ACE
training data and evaluation data were distributed as e-corpora (LDC2005E18, LDC2005E23)
to participants in the 2005 ACE evaluation. Some of the files in ACE 2005 English
SpatialML Annotations may originate from one of those e-corpora, not from LDC2006T06).
The SpatialML annotation scheme is intended to emulate earlier progress on time expressions
such as TIMEX2, TimeML and the 2005 ACE guidelines. The main SpatialML tag is the
PLACE tag. The central goal of SpatialML is to map PLACE information in text to data
from gazetteers and other databases to the extent possible. Therefore, semantic attributes
such as country abbreviations, country subdivision and dependent area abbreviations
(e.g., US states), and geo-coordinates are used to help establish such a mapping.
LINK and PATH tags express relations between places, such as inclusion relations and
trajectories of various kinds. Information in the tag along with the tagged location
string should be sufficient to uniquely determine the mapping, when such a mapping
is possible. This also means that redundant information is not included in the tag.
To the extent possible, SpatialML leverages ISO and other standards towards the goal
of making the scheme compatible with existing and future corpora. The SpatialML guidelines
are compatible with existing guidelines for spatial annotation and existing corpora
within the ACE research program. In particular, the English Annotation Guidelines
for Entities (Version 5.6.6 2006.08.01) were exploited, specifically the GPE, Location,
and Facility entity tags, and the Physical relation tags, all of which are mapped
to SpatialML tags. Ideas were also borrowed from Toponym Resolution Markup Language
of Leidner (2006), the research of Schilder et al. (2004) and the annotation scheme
in Garbin and Mani (2005). Information recorded in the annotation is compatible with
the feature types in the Alexandria Digital Library. This corpus also leverages the
integrated gazetteer database (IGDB) of Mardis and Burger (2005). Last but not least,
this annotation scheme can be related to the Geography Markup Language (GML) defined
by the Open Geospatial Consortium (OGC), as well as Google Earth's Keyhole Markup
Language (KML), to express geographical features. SpatialML goes beyond these schemes,
however, in terms of providing a richer markup for natural language that includes
semantic features and relationships that allow mapping to existing resources such
as gazetteers. Such a markup can be useful for (i) disambiguation, (ii) integration
with mapping services, and (iii) spatial reasoning. In relation to (iii), it is possible
to use spatial reasoning not only for integration with applications, but for better
information extraction, e.g., for disambiguating a place name based on the locations
of other place names in the document. SpatialML goes to some length to represent topological
relationships among places, derived from the RCC8 Calculus (Randell et al. 1992, Cohn
et al. 1997). Addtional information about SpatialML is contained in the paper "SpatialML:
Annotation Scheme for Marking Spatial Expressions in Natural Lanugage," which is included
in the online documentation for this corpus. Please direct all questions about this
corpus to Janet Hitzeman (hitz@mitre.org)
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content analysis (Communication).
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mani, Inderjeet
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hitzeman, Janet
ADDED ENTRY--PERSONAL NAME
- Personal name:
Richer, Justin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harris, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634638
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 614-115-041-059-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Portland Cellular Telephone Speech Version 1.3
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CSLU: Portland Cellular Telephone Speech Version 1.3 was created by the Center for
Spoken Language Understanding (CSLU) at OGI School of Science and Engineering, Oregon
Health and Science University, Beaverton, Oregon. It consists of cellular telephone
speech and corresponding transcripts, specifically, 7,571 utterances from 515 speakers
who made calls in the Portland, Oregon area using cellular telephones. Speakers called
the CSLU data collection system on cellular telephones, and they were asked to repeat
certain phrases and to respond to other prompts. Two prompt protocols were used: an
In Vehicle Protocol for speakers calling from inside a vehicle and a Not in Vehicle
Protocol for those calling from outside a vehicle. The protocols shared several questions,
but each protocol contained distinct queries designed to probe the conditions of the
caller's in vehicle/not in vehicle surroundings. Not every caller provided a response
to each prompt. *Recording Details* The speeech data was captured digitally from CSLU's
T1 connection and saved as 8 khz, 16-bit linear. *Transcriptions* The text transcriptions
in this corpus were produced using the non time-aligned word-level conventions described
in The CSLU Labeling Guide, which is included in the documentation for this release.
CSLU: Portland Cellular Telephone Speech Version 1.3 contains orthographic and phonetic
transcriptions of corresponding speech files. Non time-aligned orthographic transcriptions
provide quick access to the content of an utterance; they may contain markers for
word boundaries to support access and retrieval at the lexical level. Phonetic/phonemic
transcriptions represent the phonetic content of an utterance at a given level of
detail that is made explicit by the use of diacritics. Phonetic phenomena transcribed
includes excessive nasalization, glottalization, frication on a stop, centralization,
lateralization, rounding and palatalization.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fanty, Mark
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lander, T.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634670
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 571-537-588-741-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: National Cellular Telephone Speech Release 2.3
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for CSLU: Nattional Cellular Telephone Speech Release
2.3, Linguistic Data Consortium (LDC) catalog number LDC2008S02 and isbn 1-58563-467-0.
CSLU: National Cellular Telephone Speech Release 2.3 was created by the Center for
Spoken Language Understanding (CSLU) at OGI School of Science and Engineering, Oregon
Health and Science University, Beaverton, Oregon. It consists of cellular telephone
speech and corresponding transcripts, specifically, approximately one minute of speech
from 2336 speakers calling from locations throughout the United States. The data collection
protocol used for this release is the same protocol used in CSLU: Portland Cellular
Telephone Speech Version 1.3 (LDC2008S01). Speakers called the CSLU data collection
system on cellular telephones, and they were asked a series of questions. Two prompt
protocols were used: an In Vehicle Protocol for speakers calling from inside a vehicle
and a Not in Vehicle Protocol for those calling from outside a vehicle. The protocols
shared several questions, but each protocol contained distinct queries designed to
probe the conditions of the caller's in vehicle/not in vehicle surroundings. *Recording
Details* The data were collected with the CSLU T1 digital data collection system.
The sampling rate was 8khz, and the files were stored in 8 bit mu-law format on a
UNIX file system. In this release, the files are provided in 16-bit linearly encoded
Windows wav (riff) format. *Transcription* The text transcriptions in this corpus
were produced using the non time-aligned word-level conventions described in The CSLU
Labeling Guide, which is included in the documentation for this release. CSLU: National
Cellular Telephone Speech Release 2.3 contains orthographic and phonetic transcriptions
of corresponding speech files. Non time-aligned orthographic transcriptions provide
quick access to the content of an utterance; they may contain markers for word boundaries
to support access and retrieval at the lexical level. Phonetic/phonemic transcriptions
represent the phonetic content of an utterance at a given level of detail that is
made explicit by the use of diacritics. Phonetic phenomena transcribed includes excessive
nasalization, glottalization, frication on a stop, centralization, lateralization,
rounding and palatalization.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lander, T.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Durham, T.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634654
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 563-087-655-210-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
OntoNotes Release 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The OntoNotes project is a collaborative effort between BBN Technologies, the University
of Colorado, the University of Pennsylvania, and the University of Southern California's
Information Sciences Institute. The goal of the project is to annotate a large corpus
comprising various genres of text (news, conversational telephone speech, weblogs,
use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic)
with structural information (syntax and predicate argument structure) and shallow
semantics (word sense linked to an ontology and coreference). OntoNotes Release 2.0
is a continuation of the OntoNotes project and is supported by the Defense Advanced
Research Projects Agency, GALE Program Contract No. HR0011-06-C-0022. OntoNotes Release
1.0 (LDC2007T21) contains 400k words of Chinese newswire data (from Xinhua News Agency
and Sinorama Magazine) and 300k words of English newswire data (from the Wall Street
Journal). OntoNotes Release 2.0 adds the following to the corpus: 274k words of Chinese
broadcast news data (from China Broadcating System, China Central TV, China National
Radio, China Television System and Voice of America); and 200k words of English broadcast
news data (from ABC, CNN, NBC, Public Radio International and Voice of America). Natural
language applications like machine translation, question answering, and summarization
currently are forced to depend on impoverished text models like bags of words or n-grams,
while the decisions that they are making ought to be based on the meanings of those
words in context. That lack of semantics causes problems throughout the applications.
Misinterpreting the meaning of an ambiguous word results in failing to extract data,
incorrect alignments for translation, and ambiguous language models. Incorrect coreference
resolution results in missed information (because a connection is not made) or incorrectly
conflated information (due to false connections). OntoNotes builds on two time-tested
resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument
structure. Its semantic representation will include word sense disambiguation for
nouns and verbs, with each word sense connected to an ontology, and coreference. The
current goals call for annotation of over a million words each of English and Chinese,
and half a million words of Arabic over five years. The authors wish to make this
resource available to the natural language research community so that decoders for
these phenomena can be trained to generate the same structure in new documents. Lessons
learned over the years have shown that the quality of annotation is crucial if it
is going to be used for training machine learning algorithms. Taking this cue, each
layer of annotation in OntoNotes will have at least 90% inter-annotator agreement.
Pilot studies have shown that predicate structure, word sense, ontology linking, and
coreference can all be annotated rapidly and with better than 90% consistency.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Weischedel, Ralph
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pradhan, Sameer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ramshaw, Lance
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitchell
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taylor, Ann
ADDED ENTRY--PERSONAL NAME
- Personal name:
Greenberg, Craig
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hovy, Eduard
ADDED ENTRY--PERSONAL NAME
- Personal name:
Belvin, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Houston, Ann
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634662
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 488-589-036-315-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Penn Discourse Treebank Version 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Penn Discourse Treebank (PDTB) is an NSF funded project at the University of Pennsylvania.
The goal of the project is to annotate the 1 million word Wall Street Journal corpus
in Treebank-2 (LDC95T7) with discourse relations holding between the eventualities
and propositions mentioned in text, which serve as the arguments to the relation.
Discourse relations are assumed to have exactly two arguments. PDTB, version 2.0.
is a continuation of PDTB, version 1.0. (made available freely in 2006 but no longer
available). Following a lexically grounded approach to annotation, the PDTB annotates
relations realized explicitly by Explicit connectives drawn from syntactically well-defined
classes, as well as relations between adjacent sentences when no Explicit connective
appears to relate the two. Arguments of relations are annotated in each case. For
Explicit connectives, arguments are unconstrained in terms of their distance from
the connective and can be found anywhere in the text. Between adjacent sentences where
no Explicit connective appears, four scenarios hold: (a) the sentences may be related
by a discourse relation that has no realization in the second sentence, in which case
a connective (called an Implicit connective) is provided to express the inferred relation
(b) the sentences may be related by a discourse relation that is realized by some
alternative non-connective expression, in which case these alternative lexicalizations
are annotated as the carriers of the relation (labelled as AltLex) (c) the sentences
may be related not by a discourse relation, but merely by an entity-based coherence
relation, in which case the presence of such a relation is labelled (as EntRel) and
(d) the sentences may not be related at all, in which case they are labelled as such
(NoRel). In addition to the argument structure of relations, the PDTB provides (a)
sense annotations for each discourse relation while also capturing the polysemy of
connectives, and (b) attribution annotations of relations and each of their arguments,
with each instance of attribution providing the corresponding text span along with
four features to capture the semantic contribution of the attribution. Both sense
and attribution annotations are provided for Explicit, Implicit, and AltLex relations,
but not for EntRel and NoRel. The lexically grounded approach in the PDTB exposes
a clearly defined level of discourse structure which will support the extraction of
a range of inferences associated with discourse connectives. To date, the PDTB group
has carried out various experiments on the corpus, particularly examining the following
issues: * alignment between syntax and discourse, particularly with regards to attribution
* sense disambiguation of discourse connectives * complexity of dependencies in discourse
The annotations in Penn Discourse Treebank Version 2.0 are linked to the Penn Treebank.
The PDTB group will continue to explore these issues and to focus on more extended
projects such as discourse parsing, automatic summarization, and natural language
generation. Further work will also explore foundational issues in discourse. PDTB,
version 2.0. annotates 40600 discourse relations, distributed into the following five
types: * 18459 Explicit Relations * 16053 Implicit Relations * 624 Alternative Lexicalizations
* 5210 Entity Relations * 254 No Relations
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Discourse analysis
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Prasad, Rashmi
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dinesh, Nikhil
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miltsakaki, Eleni
ADDED ENTRY--PERSONAL NAME
- Personal name:
Campion, Geraud
ADDED ENTRY--PERSONAL NAME
- Personal name:
Joshi, Aravind
ADDED ENTRY--PERSONAL NAME
- Personal name:
Webber, Bonnie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 493-213-123-848-0
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
STC-TIMIT 1.0 is a telephone version of TIMIT Acoustic Phonetic Continuous Speech
Corpus, LDC93S1 (TIMIT). TIMIT contains broadband recordings of 630 speakers of eight
major dialects of American English reading ten phonetically rich sentences. Created
in 1993, TIMIT was designed to provide speech data for acoustic-phonetic studies and
for the development and evaluation of automatic speech recognition systems. Since
that time, several corpora have been developed using the TIMIT database: NTIMIT, LDC93S2
(transmitting TIMIT recordings through a telephone handset and over various channels
in the NYNEX telephone network and redigitizing them); CTIMIT, LDC96S30 (passing TIMIT
files through cellular telephone circuits); FFMTIMIT, LDC96S32 (re-recording TIMIT
files with a free-field microphone); and HTIMIT, LDC98S67 (re-recording a subset of
TIMIT files through different telephone handsets). What differentiates STC-TIMIT 1.0
from other TIMIT-derived corpora is that the entire TIMIT database was passed through
an actual telephone channel in a single call. Thus, a single type of channel distortion
and noise affect the whole database. The process was managed using a Dialogic switchboard
for the calling and receiving ends. No transducer (microphone) was employed; the original
digital signal was converted to analog using the switchboard's A/D converter, transmitted
trough a telephone channel and converted back to digital format before recording.
As a result, the only distortion introduced is that of the telephone channel itself.
The STC-TIMIT 1.0 database is organized in the same manner as in the original TIMIT
corpus: 4620 files belonging to the training partition and 1680 files belonging to
the test partition. Files were recorded using 8kHz sampling frequency and muLaw encoding.
Additionally four sets of two calibration tones were generated. These were passed
through the telephone line approximately at the start of every 1/4th of the whole
database (both the source and recorded calibration tones in each set are provided).
Calibration tones are: * 2 sec. 1kHz tone * 2 sec. sweep tone from 10 Hz to 4000 Hz.
Utterances in STC-TIMIT 1.0 are time-aligned with those of TIMIT with an average precision
of 0.125 ms (1 sample), by maximizing the cross-correlation between pairs of files
from each corpus. Thus, labels from TIMIT may be used for STC-TIMIT 1.0, and the effects
of telephone channels may be studied on a frame-by-frame basis. *Data* Originally
a single wav file was created by concatenation of all files in the TIMIT database.
This file was downsampled to 8kHz and compressed using muLaw encoding. Two telephone
lines within the same building were connected to a Dialogic(R) card. One of the lines
was used as the calling-end and played the speech file, while the other line was used
as the receiving-end and recorded the new signal. The whole recording process was
conducted in a single call. Incoming speech was recorded using 8kHz sampling frequency
and muLaw encoding. After recording, the file was pre-cut according to the length
of the corresponding TIMIT database file. Each resulting file was then aligned to
its corresponding file in TIMIT using the xcorr routine in Matlab(R). Based on these
results, the recorded file was sliced again from the original recorded file using
the newly-generated alignments. Thus, each file in STC-TIMIT 1.0 is aligned to its
equivalent in TIMIT and has the same length. *Sample* For an example of the data contained
in this corps, please listen to this audio sample.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morales, Nicolas
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634697
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 160-931-786-050-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Chinese Blog Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Blogs are posts to informal web-based journals of varying topical content. GALE Phase
1 Chinese Blog Parallel Text was prepared by the LDC and consists of 313K characters
(277 files) of Chinese blog text and its translation selected from eight sources.
This release was used as training data in Phase 1 of the DARPA-funded GALE program.
*Source Data* Preparing the source data involved four stages of work: data scouting,
data harvesting, formatting, and data selection. Data scouting involved manually searching
the web for suitable blog text. Data scouts were assigned particular topics and genres
along with a production target in order to focus their web search. Formal annotation
guidelines and a customized annotation tooklit helped data scouts to manage the search
process and to track progress. Data scouts logged their decisions about potential
text of interest (sites, threads and posts) to a database. A nightly process queried
the annotation database and harvested all designated URLs. Whenever possible, the
entire site was downloaded, not just the individual thread or post located by the
data scout. Once the text was downloaded, its format was standardized (by running
various scripts) so that the data could be more easily integrated into downstream
annotation processes. Original-format versions of each document were also preserved.
Typically, a new script was required for each new domain name that was identified.
After scripts were run, an optional manual process corrected any remaining formatting
problems. The selected documents were then reviewed for content-suitability using
a semi-automatic process. A statistical approach was used to rank a document's relevance
to a set of already-selected documents labeled as "good." An annotator then reviewed
the list of relevance-ranked documents and selected those which were suitable for
a particular annotation task or for annotation in general. These newly-judged documents
in turn provided additional input for the generation of new ranked lists. Manual sentence
units/segments (SU) annotation was also performed on a subset of files following LDC's
Quick Rich Transcription specification. Three types of end of sentence SU were identified:
* statement SU * question SU * incomplete SU *Translation* After files were selected,
they were reformatted into a human-readable translation format, and the files were
then assigned to professional translators for careful translation. Translators followed
LDC's GALE Translation guidelines, which describe the makeup of the translation team,
the source data format, the translation data format, best practices for translating
certain linguistic features (such as names and speech disfluencies), and quality control
procedures applied to completed translations.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Blogs
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635448
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 770-467-034-042-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank: Part 3 v 3.2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Treebank: Part 3 (ATB3) v 3.2 was developed at the Linguistic Data Consortium
(LDC). It consists of 599 distinct newswire stories from the Lebanese publication
An Nahar with part-of-speech (POS), morphology, gloss and syntactic treebank annotation
in accordance with the Penn Arabic Treebank (PATB) Guidelines developed in 2008 and
2009. This release represents a significant revision of LDCs previous ATB3 publications:
Arabic Treebank: Part 3 v 1.0 LDC2004T11 and Arabic Treebank: Part 3 (full corpus)
v 2.0 (MPG + Syntactic Analysis LDC2005T20. The ongoing PATB project supports research
in Arabic-language natural language processing and human language technology development.
The methodology and work leading to the release of this publication are described
in detail in the documentation accompanying this corpus and in two research papers,
Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines
and Consistent and Flexible Integration of Morphological Annotation in the Arabic
Treebank. *Data* ATB3 v 3.2 contains a total of 339,710 tokens before clitics are
split, and 402,291 tokens after clitics are separated for the treebank annotation.
This release includes all files that were previously made available to the DARPA GALE
program community (Arabic Treebank Part 3 - Version 3.1, LDC2008E22). A number of
inconsistencies in the 3.1 release data have been corrected here. These include changes
to certain POS tags with the resulting tree changes. As a result, additional clitics
have been separated, and some previously incorrectly split tokens have now been merged.
One file from ATB3 v 2.0, ANN20020715.0063, has been removed from this corpus as that
text is an exact duplicate of another file in this release (ANN20020715.0018). This
reduces the number of files from 600 files in ATB3 v 2.0 to 599 files in ATB 3 v 3.2.
*Sponsorship* This work was supported in part by the Defense Advanced Research Projects
Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does
not necessarily reflect the position or the policy of the Government, and no official
endorsement should be inferred. *Sample* The included data are available in many different
formats and files, as described in detail in the corpus documentation. The following
is a screenshot excerpt taken from one of the new integrated data files: sample.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kulick, Seth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krouna, Sondos
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gaddeche, Fatma
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zaghouani, Wajdi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u hin d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634700
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008L02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 853-261-507-123-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
hin
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
hin
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008L02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Hindi WordNet, Linguistic Data Consortium (LDC) catalog number LDC2008L02 and isbn
1-58563-470-0, was developed by researchers at the Center for Indian Language Technology,
Computer Science and Engineering Department, IIT Bombay. Hindi, a member of the Indo-Iranian
language family, is the primary national language of India and is spoken by approximately
500 million people making it the fifth largest language in the world. Inspired by
the well-known English language Wordnet, Hindi Wordnet is the first wordnet for an
Indian language. Wordnets are systems for analyzing the different lexical and semantic
relations between words. Specifically, a wordnet is a word sense network in which
words are grouped into sematically equivalent units called synsets. Each synset represents
a lexical concept, and synsets are linked to each other by semantic relations (between
synsets) and lexical relations (between words). Similar in design to the Princeton
Wordnet for English, Hindi Wordnet incorporates additional features to capture the
complexities of Hindi. This release of Hindi Wordnet consists of 56,928 unique words
and 26,208 synsets. Additional information about the development of Hindi Wordnet
is available at the Hindi WordNet web site. *Data* Hindi WordNet contains nouns, verbs,
adjectives and adverbs. Each entry consists of the following elements: * Synset: a
set of synonymous words. For example, ?विद्यालय, पाठशाला, स्कूल? (vidyaalay, paaThshaalaa,
skuul) represents the concept of school as an educational institution. The words in
the synset are arranged according to the frequency of usage. * Gloss: the concept.
It consists of two parts: Text definition: It explains the concept denoted by the
synset. For example, ?वह स्थान जहाँ प्राथमिक या माध्यमिक स्तर की औपचारिक शिक्षा दी
जाती है? (vah sthaan jahaaM praathamik yaa maadhyamik star kii aupacaarik sikshaa
dii jaatii hai) explains the concept of school as an educational institution. Example
sentence: It gives the usage of the words in the sentence. Generally, the words in
a synset are replaceable in the sentence. For example,"इस विद्यालय में पहली से पाँचवीं
तक की शिक्षा दी जाती है? (is vidyaalay me pahalii se pancvii tak kii shikshaa dii
jaatii hai) gives the usage for the words in the synset representing schoolas an educational
institution. * Position in Ontology: An ontology is a hierarchical organization of
concepts, or more specifically, a categorization of entities and actions. A separate
ontological hierarchy exists for each syntactic category (noun, verb, adjective adverb).
Each synset is mapped into some place in the ontology.. This release of Hindi WordNet
is made available as a complete Java application along with an API to facilitate further
development.
LANGUAGE NOTE
- Language note:
Content in Hindi. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Hindi language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Hindi language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bhattacharyya, Pushpak
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pande, Prabhakar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lupu, Laxmi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008L02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634514
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 794-819-316-121-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Proposition Bank 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Proposition Bank 2.0 is a continuation of the Chinese Propostion Bank project,
which aims to create a corpus of Chinese text annotated with information about basic
semantic propositions. Chinese Propostion Bank 1.0 consists of predicate-argument
annotation on 250,000 words from Chinese Treebank 5.0. Chinese Proposition Bank 2.0
adds predicate-argument annotation on 500,000 words from Chinese Treebank 6.0. The
data sources include newswire from Xinhua News Agency, articles from Sinorama Magazine,
news from the website of the Hong Kong Special Administrative Region and transcripts
from various Chinese broadcast news programs. *Data* This release contains the predicate-argument
annotation of 81,009 verb instances (11,171 unique verbs) and 14,525 noun instances
(1,421 unique nouns). The annotation of nouns is limited to nominalizations that have
a corresponding verb. The general annotation guidelines and the lexical guidelines
(called frame files) for each verbal and nominal predicate are included in this release.
Total propositions for verbs: 81,009 Total propositions for nouns: 14,525 Total verbs
framed: 11,171 Total framesets: 11,776 Verbs with multiple framesets: 474 Average
framesets per verb: 1.05 Total nouns framed: 1,421 Total noun framesets: 1,528 Nouns
with multiple framesets: 48 Average framesets per noun: 1.08
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
- Geographic subdivision:
China
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chang, Meiyu
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jiang, Zixin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u por d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634719
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 563-271-238-124-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
por
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
por
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
West Point Brazilian Portuguese Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
West Point Brazilian Portuguese Speech is a database of digital recordings of spoken
Brazilian Portuguese designed and collected by staff and faculty of the Department
of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL)
to develop acoustic models for speech recognition systems. The U.S. government uses
such systems to provide speech-recognition enhanced language learning courseware to
government linguists and students enrolled in various government language programs.
The data in this corpus was collected in March 1999 in Brasilia, Brazil using informants
from a Brazilian military academy. The corpus consists of read speech from 60 female
and 68 male native and non-native speakers. The speech was elicited from a prompt
script containing 296 sentences and phrases typically used in language learning situations.
The prompts are listed in the file prompts.txt. Each line of this file has two fields
separated by a tab: the first field denotes the base name of the waveform file; and
the second field denotes the prompt used to record the utterance. A pronouncing dictionary
developed by Dr. Sheila Ackerlind with help from cadet Sterling Packer is provided
in the file SANTIAGO.txt. The speech was collected using four laptop computers running
MS Windows. Three of the computers recorded with a 16 bit data size and sampling rate
of 22050 Hz, the other laptop recorded with an 8 bit data size at a sampling rate
of 11025 Hz. The recording script presented a visual display of the sentence to be
recorded. The informant pressed a key and spoke the sentence. The recording was played
back for review, allowing the utterance to be re-recorded. A member of the data collection
team was present during the recording session to verify recordings and to provide
technical assistance in case of malfunctioning equipment.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Portuguese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Portuguese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Portuguese
- Geographic subdivision:
Brazil
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morgan, John
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ackerlind, Sheila
ADDED ENTRY--PERSONAL NAME
- Personal name:
Packer, Sterling
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634751
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 374-785-718-527-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 contains transcripts and
English translations of 21.9 hours of Chinese broadcast news programming from China
Central TV (CCTV) and Phoenix TV. It does not contain the audio files from which the
transcripts and translations were generated. GALE Phase 1 Chinese Broadcast News Parallel
Text - Part 2 is the second of the three-part GALE Phase 1 Chinese Broadcast News
Parallel Text, which, along with other corpora, was used as training data in year
1 (Phase 1) of the DARPA-funded GALE program. GALE Phase 1 Chinese Broadcast News
Parallel Text - Part 1 was published in 2007. *Source Data* A total of 21.9 hours
of Chinese broadcast news recordings were selected from two sources, CCTV (a broadcaster
from Mainland China) and Phoenix TV (a Hong Kong based satellite TV station). The
transcripts and translations represent recordings of four different programs. A manual
selection procedure was used to choose data appropriate for the GALE program, namely,
news programs focusing on current events. Stories on topics such as sports, entertainment
and business were excluded from the data set. The following table is a summary of
the files included in this release. Source Program Epoch (YYYY.MM) #hours #characters
CCTV4 CCTV4 Daily News 2004.11 - 2005.11 10.2 150,947 CCTV4 CCTV4 News3 2005.04 -
2005.10 2.9 41,356 Phoenix TV Global Report 2005.05 - 2005.10 5.7 73,717 Phoenix TV
Good Morning China 2005.10 - 2005.11 3.1 46,491 *Transcription* The selected audio
snippets were carefully transcribed by LDC annotators and professional transcription
agencies following LDC's Quick Rich Transcription specification. Manual sentence unit/segment
(SU) annotation was also performed as part of the transcription task. Three types
of end of sentence SU are identified: - statement SU - question SU - incomplete SU
*Translation* After transcription and SU annotation, files were reformatted into a
human-readable translation format and assigned to professional translators for careful
translation. Translators followed LDC's GALE Translation guidelines, which describe
the makeup of the translation team, the source data format, the translation data format,
best practices for translating certain linguistic features (such as names and speech
disfluencies) and quality control procedures for completed translations. *Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency,
GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not
necessarily reflect the position or the policy of the Government, and no official
endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634794
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 466-566-464-744-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 is the second of the three-part
GALE Phase 1 Arabic Broadcast News Parallel Text, which, along with other corpora,
was used as training data in year 1 (Phase 1) of the DARPA-funded GALE program. GALE
Phase 1 Arabic Broadcast News Parallel Text - Part 1 was released in 2007. GALE Phase
1 Arabic Broadcast News Parallel Text - Part 2 contains transcripts and English translations
of 10.7 hours of Arabic broadcast news programming selected from various sources.
This corpus does not contain the audio files from which the transcripts and translations
were generated. LDC has released the following GALE Phase 1 & 2 Arabic Parallel Text
data sets: * GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24)
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09) * GALE Phase
1 Arabic Blog Parallel Text (LDC2008T02) * GALE Phase 1 Arabic Newsgroup Parallel
Text - Part 1 (LDC2009T03) * GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
(LDC2009T09) * GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14) * GALE
Phase 2 Arabic Newswire Parallel Text (LDC2012T17) * GALE Phase 2 Arabic Broadcast
News Parallel Text (LDC2012T18) * GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)
*Source Data* A total of 10.7 hours of Arabic broadcast news recordings were selected
from four sources and four different programs. A manual selection procedure was used
to choose data appropriate for the GALE program, namely, news and conversation programs
focusing on current events. Stories on topics such as sports, entertainment news,
and stock market reports were excluded from the data set. The following table is a
summary of the files included in this release. Source Program Epoch (YYYY.MM) #hours
#words Dubai TV Dubai News 2005.02 - 2005.12 2.0 11,078 Nile TV News 2000.11 - 2000.12
0.5 3,079 Radio Sawa News at 06:00 2005.11 2.8 9,712 Voice of America News 2000.11
- 2001.03 5.4 32,305 *Transcription* The selected audio files were carefully transcribed
by LDC annotators and professional transcription agencies following LDC's Quick Rich
Transcription specification. Manual sentence units/segments (SU) annotation was also
performed as part of the transcription task. Three types of end of sentence SU are
identified: - statement SU - question SU - incomplete SU *Translation* After transcription
and SU annotation, files were reformatted into a human-readable translation format
and were assigned to professional translators for careful translation. Translators
followed LDC's GALE Translation guidelines, which describe the makeup of the translation
team, the source data format, the translation data format, best practices for translating
certain linguistic features (such as names and speech disfluencies), and quality control
procedures applied to completed translations. *Final Data* TDF Format All final data
are in Tab Delimited Format (TDF). TDF is compatible with other transcription formats,
such as Transcriber format and AG format, and it is easy to process. Each line of
a TDF file corresponds to a speech segment and contains 13 tab delimited fields (the
13th field "suType" might be empty): field data_type ----- --------- 1 file unicode
2 channel int 3 start float 4 end float 5 speaker unicode 6 speakerType unicode 7
speakerDialect unicode 8 transcript unicode 9 section int 10 turn int 11 segment int
12 sectionType unicode 13 suType unicode A source TDF file and its translation are
the same except that the transcript in the source TDF is replaced by its English translation.
Encoding All data are encoded in UTF8. *Sponsorship* This work was supported in part
by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or the policy
of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zakhary, Dalal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u tam d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634778
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 747-471-848-124-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2005 NIST Language Recognition Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for 2005 NIST Language Recognition Evaluation, Linguistic
Data Consortium (LDC) catalog number LDC2008S05 and isbn 1-58563-477-8. The goal of
the NIST (National Institute of Standards and Technology) Language Recognition Evaluation
(LRE) is to establish the baseline of current performance capability for language
recognition of conversational telephone speech and to lay the groundwork for further
research efforts in the field. NIST conducted two previous evaluations in 1996 and
2003. For the 2005 LRE, the emphasis was on research directed toward a general base
of technology to be ported to various language recognition tasks with minimum effort
and the development of the ability to make more difficult discriminations between
similar languages and dialects of the same language. That focus augmented the traditional
evaluation goals, those being: * to drive the technology forward * to measure the
state-of-the-art * to find the most promising algorithmic approaches The task evaluated
was the detection of a given target language or dialect. From a test segment of speech
and a target language or dialect, the system to be evaluated determined whether the
speech was from the target language or dialect. The evaluation consisted of speech
from the following languages and dialects: * English (American) * English (Indian)
* Hindi * Japanese * Korean * Mandarin (Mainland) * Mandarin (Taiwan) * Spanish (Mexican)
* Tamil The 2005 NIST Language Recognition Evaluation Plan, which includes a description
of the evaluation tasks, is included with this release. Further information regarding
this evaluation is also available at the NIST Language Recognition Evaluation website.
*Data* Each speech file is one side of a "4-wire" telephone conversation represented
as 8-bit 8 kHz mulaw data. There are 11,106 speech files in sphere (.sph) format for
a total of 73.2 hours of speech. The speech data was compiled from LDC's CALLFRIEND
corpora and from data collected by Oregon Health and Science University, Beaverton,
Oregon. Each test segment was prepared using an automatic speech activity detection
algorithm to identify areas and durations of speech. The test segments were stored
in SPHERE file format, one segment per file. Unlike previous evaluations, areas of
silence were not removed from the segments. Segments were chosen to contain a specified
approximate duration of actual speech. Auxiliary information was included in the SPHERE
headers to document the source file, start time, and duration of all excerpts that
were used to construct the segment. The test segments contain three nominal durations
of speech: 3 seconds, 10 seconds, and 30 seconds. Actual speech durations vary, but
were constrained to be within the ranges of 2-4 seconds, 7-13 seconds, and 25-35 seconds,
respectively. Note that this refers to duration of actual speech contained in segments
as determined by the speech activity detection algorithm; signal durations in general
are longer due to areas of silence in the segments. Shorter speech duration test segments
are subsets of longer speech duration test segments; i.e., each 10-second test segment
is a subset of a corresponding 30-second test segment, and each 3-second test segment
is a subset of a corresponding 10-second segment. Performance was evaluated separately
for test segments of each duration. NIST recommends using data from the 1996 and 2003
evaluations as development data. This data may be found in 2003 NIST Language Recognition
Evaluation, LDC2006S31. Because the 1996 and 2003 evaluations did not cover Indian-accented
English, this release includes a development data set of Indian-accented English.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Tamil, Korean, Japanese, Hindi, English, Spanish, and Mandarin Chinese.
Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Le, Audrey
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hadfield, Hannah
ADDED ENTRY--PERSONAL NAME
- Personal name:
de Villiers, Jacques
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hosom, John-Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
van Santen, Jan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634786
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 569-415-930-320-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Alphadigit Version 1.3
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for CSLU: Alphadigit Version 1.3 , Linguistic Data
Consortium (LDC) catalog number LDC2008S06 and isbn 1-58563-478-6. Alphadigit Version
1.3 is a collection of 78,044 utterances from 3,025 speakers saying six-digit strings
of letters and digits over the telephone for a total of approximately 82 hours of
speech. Each speech file has corresponding orthographic and phonemic transcriptions.
This corpus was created by the Center for Spoken Language Understanding (CSLU), Oregon
Health & Science University, Beaverton, Oregon. *Data* Speakers were recruited using
USEnet postings. Respondents registered for the collection by completing an online
form. Once registered, they received a list of 18-29 six-digit strings (e.g., "a 2
b 4 5 g") and participation instructions. Speakers called the CSLU data collection
system by dialing a toll-free number and were prompted for each string; 1102 different
strings were used throughout the course of the data collection. The lists were set
up to balance for phonetic context between all letter and digit pairs. The data were
recorded directly from a digital phone line without digital-to-analog or analog-to-digital
conversion at the recording end using the CSLU T1 digital data collection system.
The sampling rate was 8khz and the files were stored in 8-bit mu-law format on a UNIX
file system. The files have been converted to RIFF standard file format, 16-bit linearly
encoded. *Transcription* All of the files included in this corpus have corresponding
non-time-aligned word-level transcriptions and time aligned phoneme-level transcriptions
(automatic forced alignment) that comply with the conventions in the CSLU Labeling
Guide. Non time-aligned orthographic transcriptions provide quick access to the content
of an utterance; they may contain markers for word boundaries to support access and
retrieval at the lexical level. Phonetic/phonemic transcriptions represent the phonetic
content of an utterance at a given level of detail that is made explicit by the use
of diacritics. Phonetic phenomena transcribed include excessive nasalization, glottalization,
frication on a stop, centralization, lateralization, rounding and palatalization.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lander, T.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Durham, T.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u yor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635006
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008L03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 973-344-578-516-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
yor
- Language code of text/sound track or separate title:
cpe
- Language code of text/sound track or separate title:
yor
- Language code of text/sound track or separate title:
cpe
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yor
- Language code of text/sound track or separate title:
trf
- Language code of text/sound track or separate title:
luq
- Language code of text/sound track or separate title:
gul
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Global Yoruba Lexical Database v. 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008L03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Global Yoruba Lexical Database v. 1.0 is a set of related dictionaries providing
definitions and translations for over 450,000 words from the Yoruba language and its
variants: Standard Yoruba (over 368,000 words), Gullah (over 3,600 words), Lucumí
(over 8,000 words) and Trinidadian (over 1,000 words). Yoruba is a Niger-Congo language
(sub classification: Kwa > Yoruboid) spoken natively by nearly 20 million people,
the vast majority of them in southwestern Nigeria. There are also approximately a
half million Yoruba speakers in Benin, as well as speakers in Togo and Ghana and among
the emigrant populations in the United States and the United Kingdom. In addition,
roughly two million people in Nigeria speak Yoruba as a second language. The Yoruba
language diaspora is wide, stretching from southwestern Nigeria and Benin westward
to the Caribbean and islands along the southeastern United States coast. Yoruba and
other African dialects arrived in the Americas and the Caribbean as a consequence
of the Atlantic slave trade. Throughout the region, Yoruba dialects blended with each
other and with languages like Spanish and French to form a variety of creoles such
as Gullah in the United States and Nagô in Brazil. Many of those creoles have become
the language of liturgy and music in Cuba, Brazil, Argentina, Trinidad, Jamaica and
parts of the United States and Canada. The ultimate goal of this dictionary is to
provide coverage for all Yoruba dialects across the globe. For that reason, it will
continue to be a work in progress. The current standard orthography is tone-driven.
Yoruba has three tones: a high tone, a middle tone and a low tone. Each syllable in
a Yoruban word must have at least one tone and long vowels may have two tones. While
there are no explicit rising or falling tones, combinations of the languages three
basic tones may produce the same effect. Grammatically, Yoruba is a Subject-Verb-Object
(SVO) language. Verbs have no infinitive forms, past or present tense and typically
have only a single syllable. Discrete auxiliary words provide information on the verb
tense. Nor do Yoruba nouns have plural or singular form their number derives from
the context in which the word occurs. The Yoruba dialect continuum consists of over
fifteen varieties, with considerable phonological and lexical differences among them
and some grammatical ones as well. Peripheral areas of dialectal regions often have
some similarities to adjoining dialects. Standard Yoruba is a koine used for education,
writing, broadcasting, and contact between speakers of different dialects. It is also
called Literary Yoruba, common Yoruba, or simply Yoruba without qualification. Though
in large part based on the Ò?yò? and Ibadan dialects, it incorporates several features
from other dialects and has a simplified vowel harmony system and some other features
not found in other Yoruba dialects. *Data * This release encompasses the following
languages and dialects: Languages Description Number of words Yoruba->English This
dictionary of Standard Yoruba contains detailed lexicographic entries which include
the part of speech, the English definition of the Yoruba headword, cross references,
examples in English and the morphemic decomposition of the Yoruba headword. 142,389
English->Yoruba This dictionary maps the English headword back to Standard Yoruba
and includes the part of speech, Yoruba definition, and morphemic decomposition of
the Yoruba word. 226,585 Gullah->English and Yoruba Gullah is a creole spoken in the
coastal Low Country of South Carolina and Georgia in the United States. Although the
language is no longer spoken to a great extent, its words are still commonly used
for personal names and nicknames. The dictionary translates from Gullah headwords
to English and to Standard Yoruba. 3,636 Lucumí->Spanish, English and Yoruba Lucumí
is the ritual language of the Santeria religion practiced in Cuba. The Lucumí dictionary
translates from a Lucumí headword to Cuban Spanish to English to Standard Yoruba.
At the time of this publication in 2008, some entries do not have complete translations
and only map from Lucumí to Cuban Spanish. 8,075 Trinidadian->English and Yoruba Trinidadian
is a creole which blends English, French, Spanish and African languages. The Trinidadian
dictionary presents those words that have Yoruban roots and maps from the Trinidadian
headword to English and Standard Yoruba. 1,187 The dictionaries in this publication
are presented in two formats, Toolbox databases and XML. Short for The Field Linguists
Toolbox, Toolbox is a lexicographical database system published by SIL. SIL makes
Toolbox freely available for download. In order to use the Global Yoruba Lexical Database
v. 1.0, Toolbox must first be installed on the users local computer. The orthography
of the text in the databases conforms to that presented to students in the Nigerian
school system. The basic Yoruba alphabet is: a b d e e? f g gb h i j k l m n o o?
p r s s? t u w y The letter gb is a digraph, two letters that combine to form a single
phoneme. In written Yoruba, gb functions as a single letter. In the Toolbox presentation,
this has been taken into account and the software sorts the words accordingly in all
functions. The XML presentation has been sorted according to the above alphabet but
is a static, flat file. For that reason, developers creating applications from the
XML files will need to take into account the digraph when writing searching and reporting
functions. As Yoruba is a tonal language, the written language uses additional diacritic
marks to denote tones. The orthography uses three tones: * Low: denoted with a grave
symbol () as in à * Mid: plain letter without diacritics * High: denoted with an acute
(´) symbol as in á Both the Toolbox and XML presentations encode the text in Unicode
UTF-8 using normalized form C. Unicode normalized forms govern the order in which
letters and characters are composed and processed by software systems. Normalized
form C is the standard form used by most web systems and is a W3C standard for the
web. The Toolbox presentation uses the Aria Unicode MS font for display. The Tahoma
and Lucida Grande fonts will also display the Yoruba alphabet under UTF-8 encoding.
Since XML only provides information about document structure, fonts are not specified
in the XML versions of the dictionaries. Displaying non-Western letters:Windows users
will need to install and configure their computers for Extended Language support.
To do this, open the Windows Control Panel and click the Regional and Language Options
icon. In the Regional and Language Options window that opens, select the Languages
pane. Under the Supplemental Language Support section, check both check boxes and
click okay. Windows will as for your install disc and will install the modules needed
to properly display complex and non Western letters. If users do not have their Windows
install disc, they should contact their local system administrator to install Extended
Language Support.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yoruba, Trinidadian Creole English, Lucumi, Sea Island Creole English,
English, and Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Awoyale, Yiwola
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008L03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634980
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 379-986-207-358-6
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The PennBioIE CYP Corpus consists of 1100 PubMed abstracts on the inhibition of cytochrome
P450 enzymes, comprising approximately 274,000 words of biomedical text, tokenized
and annotated for paragraph, sentence, part of speech, and five types of biomedical
named entities in three categories of interest. 324 of the abstracts have also been
syntactically annotated. All of the annotation was based on Penn Treebank II standards,
with some modifications for special characteristics of the biomedical text. The entity
definitions were developed and revised in an extensive process of interaction between
domain experts and biomedically trained annotators. The data was prepared by the Linguistic
Data Consortium for the Institute for Research in Cognitive Science, with funding
from the National Science Foundation under Grant No. ITR EIA-0205448, Information
Technology Research (ITR) program, in collaboration with GlaxoSmithKline Pharmaceuticals
R&D. *Data Description* The corpus contains 1100 PubMed abstracts comprising approximately
313,000 total words of text. Each file has been tokenized and its biomedical portions
(274,000 words) exhaustively annotated for paragraph, sentence, and part of speech,
and non-exhaustively annotated for 5 types of named entity. Each token has a part-of-speech
tag. Tokens and POS tags: Tokens in biomedical and chemical notation and terms, and
spelled-out numbers, may contain whitespace and/or punctuation ("beta, 20 diol", "(Na+
+ K+)ATPase", "two hundred seven"); and named entity mentions may comprise several
tokens ("polychlorinated biphenyl preparations"). Tokens and entities do not span
sentence boundaries. Biomedical and non-biomedical text: The title and body of each
abstract are considered to be biomedical text, and the automatic and manual annotations
in them have been extensively curated. Everything else, such as citation information
and author names, is considered non-biomedical; this has not been entity annotated,
and its automated tokenization and part of speech tags have not been curated and are
known to be unreliable. In non-biomedical text, the tag "section" is used instead
of "sentence", allowing users to include or exclude these parts. There are approximately
327,000 words of biomedical text and 39,000 words of non-biomedical text. *Principles
and Methods* Many annotation projects start with an already annotated corpus, such
as the Penn Treebank or the Brown Corpus, which is treated as unchangeable. As a result,
annotation practices have sometimes involved compromises which might not have been
necessary if the earlier annotation had been able to integrate the requirements of
the later work. Such integration is necessary here because of the scope of this project,
involving highly technical biomedical texts, entity definitions driven by the needs
of biomedical research, and the goal of making the annotation layers work together
as much as possible, e.g., using entity information in the treebank annotation of
prenominal modifiers. Such integration is also possible given the relatively long
term of the grant (five years) and because researchers were starting with fresh text,
applying all layers of annotation themselves. The texts are annotated at the following
layers: * Paragraph * Sentence * Biomedical entity * Token and part of speech * Syntax
(treebanking) (some texts only) * Semantic relations Paragraph, sentence, tokenization,
POS, and syntactic annotation (treebanking) are applied by automatic taggers and manually
corrected ; entity annotation is manual. The authors originally used a POS tagger
trained on Penn Treebank data, which made many errors on the very different text of
these biomedical abstracts. When there was enough manually-corrected data to train
a tagger, overall accuracy rose from 88.53% to 97.33% (Kulick et al. 2004 (slides)).
Annotation at all layers except entity is based on the Penn Treebank II guidelines,
with a number of modifications that have been found necessary, many of which were
subsequently adopted by the Penn Treebank. Entity definitions came originally from
domain experts and were developed and refined in dialogue with the annotators. All
annotation is standoff: the source text is never modified, annotations being made
in a separate file.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Medicine
- Form subdivision:
Terminology
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Medicine
- Form subdivision:
Databases.
- General subdivision:
Abstracting and indexing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Medicine
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liberman, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mandel, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
GlaxoSmithKline Pharmaceuticals R&D
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634816
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 377-937-743-934-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BLLIP North American News Text, Complete
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Brown Laboratory for Linguistic Information Processing (BLLIP) North American News
Text, Complete, LDC2008T13, isbn 1-58563-482-4, contains a Penn Treebank-style parsing
of approximately 24 million sentences from the North American News Text Corpus (LDC95T21).
The North American News Text Corpus consists of English news text from the Los Angeles
Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service
(1994-1996) and the Wall Street Journal (1994-1996). BLLIP North American News Text
is released in two versions: BLLIP North American News Text, Complete (LDC2008T13),
a members-only corpus that contains sentences from all sources in The North American
News Text Corpus; and BLLIP North American News Text, General Release (LDC2008T14),
a corpus available to nonmembers that does not include the Wall Street Journal data
from The North American News Text Corpus. To complement the Complete and General Release
versions of BLLIP North American News Text, LDC is re-releasing The North American
News Text Corpus in two versions. North American News Text, Complete LDC2008T15, the
members-only original version, is now available as a 2008 Membership Year corpus.
North American News Text, General Release (LDC2008T16) (which does not include news
text from the Wall Street Journal), is available to nonmembers for the first time.
The directory structures of each of these publications has been restructured to be
identical to the directory structure of the BLLIP releases. *Methodology* A key problem
in natural language processing is syntactic ambiguity resulting from uncertain relationships
between words and their connections to sentence clauses. Sentences that can be constructed
with correct syntax in more than one way are ambiguous, and such sentences generate
multiple parse trees when they are separated into clauses by parts of speech. Traditional
parsing techniques, such as part-of-speech (POS) tagging, typically achieve a 90%
accuracy rate because most sentences are not ambiguous. Resolving ambiguous sentences
requires a probabilistic approach. Using the relative frequencies of grammar rules,
statistical processing techniques assign probabilities for each clause. These probabilities
are then summed up over each complete sentence parse and a probability is assigned
for that sentence parse. In that way, the most likely parse can be determined. The
data in this release was parsed into Penn Treebank-style parse trees using a re-ranking
parser developed by Eugene Charniak and Mark Johnson. The Charniak and Johnson parser
is statistically-based and uses a generative first stage followed by a discriminative
second stage. Both stages were trained on the Wall Street Journal data in Treebank-2
(LDC95T7) and Treebank-3 (LDC99T42). BLLIP 1987-1989 WSJ Corpus Release 1 (LDC2000T43)
contains a complete Treebank-style parsing of that Wall Street Journal material. In
order to produce BLLIP North American News Text, the Charniak-Johnson parser used
a simplified context free grammar in the first stage to generate a set of n best parses.
Those parses were then pruned by eliminating the parses at the edges of the distribution.
In the second stage, a maximum entropy-based parser using a complete grammar was applied.
The output trees are ranked in order of probability. *Data* The parses in BLLIP North
American News Text include constituency and POS tagging information for each of the
50-best parses of each sentence. Each file contains a sequence of n-best lists. An
n-best list is a list of the top n parses of each sentence with the corresponding
parser probability and re-ranker score. Following is an example of a simple n-best
list: 50 reute9406_007.0356_13 4.9244 -147.337 (S1 (S (NP (PRP He)) (VP (VBD argued)
(SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to)
(VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN
presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) 3.56482
-151.575 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country))
(ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (NP (DT the)
(NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency)))) (, ,) (NP (NN government)
(CC and) (NN parliament)))))))))) (. .))) 3.35952 -151.173 (S1 (S (NP (PRP He)) (VP
(VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S
(VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (NP
(DT the) (NN presidency)) (, ,) (NP (NN government) (CC and) (NN parliament))))))))))))
(. .))) 2.67662 -148.374 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the)
(NN country)) (VP (ADVP (RB first)) (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP
(DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government)
(CC and) (NN parliament))))))))))) (. .))) In the above example, the first number
("50") indicates the number of parses. The next token is the article id from the North
American News Text Corpus ("reute9406_007.0356"), followed by an underscore, followed
by the number of the sentence in the article ("13"). The parses follow; for brevity,
only three parses out of the fifty are presented here. Each parse consists of a reranker
score (4.9244 for the first parse) and parser log probability (-147.337 for the first
parse), a new line, and then the parse tree itself. Parse trees are given in Penn
Treebank format. Note that the n-best list is sorted by decreasing reranker scores.
Source material is as follows: Source Dates Approx. # Words (millions) Los Angeles
Times & Washington Post 1994-1997 52 New York Times 1994-1996 173 Reuters (General
and Financial) 1994-1996 85 Wall Street Journal (Not included in General Release)
1994-1996 40
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
McClosky, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Charniak, Eugene
ADDED ENTRY--PERSONAL NAME
- Personal name:
Johnson, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634824
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 743-612-696-894-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BLLIP North American News Text, General Release
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Brown Laboratory for Linguistic Information Processing (BLLIP) North American News
Text, General Release, LDC2008T14, isbn 1-58563-482-4, contains a Penn Treebank-style
parsing of approximately 21 million sentences from the North American News Text Corpus
(LDC95T21). The North American News Text Corpus consists of English news text from
the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996),
Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996). BLLIP North
American News Text is released in two versions: BLLIP North American News Text, Complete
(LDC2008T13), a members-only corpus that contains sentences from all sources in The
North American News Text Corpus; and BLLIP North American News Text, General Release
(LDC2008T14), a corpus available to nonmembers that does not include the Wall Street
Journal data from The North American News Text Corpus. To complement the Complete
and General Release versions of BLLIP North American News Text, LDC is re-releasing
The North American News Text Corpus in two versions. North American News Text, Complete
LDC2008T15, the members-only original version, is now available as a 2008 Membership
Year corpus. North American News Text, General Release (LDC2008T16) (which does not
include news text from the Wall Street Journal), is available to nonmembers for the
first time. The directory structures of each of these publications has been restructured
to be identical to the directory structure of the BLLIP releases. *Methodology* A
key problem in natural language processing is syntactic ambiguity resulting from uncertain
relationships between words and their connections to sentence clauses. Sentences that
can be constructed with correct syntax in more than one way are ambiguous, and such
sentences generate multiple parse trees when they are separated into clauses by parts
of speech. Traditional parsing techniques, such as part-of-speech (POS) tagging, typically
achieve a 90% accuracy rate because most sentences are not ambiguous. Resolving ambiguous
sentences requires a probabilistic approach. Using the relative frequencies of grammar
rules, statistical processing techniques assign probabilities for each clause. These
probabilities are then summed up over each complete sentence parse and a probability
is assigned for that sentence parse. In that way, the most likely parse can be determined.
The data in this release was parsed into Penn Treebank-style parse trees using a re-ranking
parser developed by Eugene Charniak and Mark Johnson. The Charniak and Johnson parser
is statistically-based and uses a generative first stage followed by a discriminative
second stage. Both stages were trained on the Wall Street Journal data in Treebank-2
(LDC95T7) and Treebank-3 (LDC99T42). BLLIP 1987-1989 WSJ Corpus Release 1 (LDC2000T43)
contains a complete Treebank-style parsing of that Wall Street Journal material. In
order to produce BLLIP North American News Text, the Charniak-Johnson parser used
a simplified context free grammar in the first stage to generate a set of n best parses.
Those parses were then pruned by eliminating the parses at the edges of the distribution.
In the second stage, a maximum entropy-based parser using a complete grammar was applied.
The output trees are ranked in order of probability. *Data* The parses in BLLIP North
American News Text include constituency and POS tagging information for each of the
50-best parses of each sentence. Each file contains a sequence of n-best lists. An
n-best list is a list of the top n parses of each sentence with the corresponding
parser probability and re-ranker score. Following is an example of a simple n-best
list: 50 reute9406_007.0356_13 4.9244 -147.337 (S1 (S (NP (PRP He)) (VP (VBD argued)
(SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to)
(VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN
presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) 3.56482
-151.575 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country))
(ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (NP (DT the)
(NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency)))) (, ,) (NP (NN government)
(CC and) (NN parliament)))))))))) (. .))) 3.35952 -151.173 (S1 (S (NP (PRP He)) (VP
(VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S
(VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (NP
(DT the) (NN presidency)) (, ,) (NP (NN government) (CC and) (NN parliament))))))))))))
(. .))) 2.67662 -148.374 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the)
(NN country)) (VP (ADVP (RB first)) (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP
(DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government)
(CC and) (NN parliament))))))))))) (. .))) In the above example, the first number
("50") indicates the number of parses. The next token is the article id from the North
American News Text Corpus ("reute9406_007.0356"), followed by an underscore, followed
by the number of the sentence in the article ("13"). The parses follow; for brevity,
only three parses out of the fifty are presented here. Each parse consists of a reranker
score (4.9244 for the first parse) and parser log probability (-147.337 for the first
parse), a new line, and then the parse tree itself. Parse trees are given in Penn
Treebank format. Note that the n-best list is sorted by decreasing reranker scores.
Source material is as follows: Source Dates Approx. # Words (millions) Los Angeles
Times & Washington Post 1994-1997 52 New York Times 1994-1996 173 Reuters (General
and Financial) 1994-1996 85 Wall Street Journal (Not included in General Release)
1994-1996 40
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
McClosky, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Charniak, Eugene
ADDED ENTRY--PERSONAL NAME
- Personal name:
Johnson, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634832
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 273-098-424-167-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
North American News Text, Complete
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
North American News Text, Complete, Linguistic Data Consortium (LDC) catalog number
LDC2008T15 and isbn 1-58563-483-2, is a collection of English news text from the Los
Angeles Times, Washington Post, New York Times, Reuters and the Wall Street Journal.
This corpus was originally released in 1995 as the North American News Text Corpus
(LDC95T21) and is reissued to complement the release of the Brown Laboratory for Linguistic
Information Processing (BLLIP) North American News Text sets (LDC2008T13, LDC2008T14),
which consist of Penn Treebank-style parsing of that news text. North American News
Text is reissued in two versions: North American News Text, Complete LDC2008T15, the
members-only original version, now available as a 2008 Membership Year corpus; and
North American News Text, General Release LDC2008T16 (which does not include text
from the Wall Journal Street Journal), available to nonmembers for the first time.
The directory structure of each of these publications has been restructured to be
identical to the directory structure of the BLLIP releases. *Data* The table below
contains a breakdown of the sources, epochs and word counts for the data in the North
American News Text releases: Source Dates # Words (millions) Los Angeles Times & Washington
Post May, 1994 - August 1997 52 New York Times News & Syndicate July, 1994 - December
1996 173 Reuters News Service (General and Finanical) April, 1994 - December 1996
85 Wall Street Journal (not in General Release) July, 1994 - December 1996 40 The
New York Times and the Los Angeles Times/Washington Post services include a range
of other newspaper sources in their syndicated newswires. The Los Angeles Times/Washington
Post material in this corpus includes some news text from the following sources: *
Newsday * The Baltimore Sun * The Hartford Courant The New York Times material in
this corpus contains some data from the following sources, although New York Times
articles predominate: * Bloomberg Business News * The Boston Globe * Los Angeles Daily
News * Fort Worth Star-Telegram * Newsweek * Cox News Service * The Arizona Republic
* Seattle Post-Intelligencer * San Francisco Examiner * Houston Chronicle * San Francisco
Chronicle * Economist Newspaper Ltd. * Hearst Newspapers The text content of each
data file (following uncompression with the GNU-unzip utility) consists of plain ASCII
character data with SGML tags to indicate article boundaries and organization of information
within each article. There are differences among the five primary newswire sources
in terms of the number and types of SGML tags used in the text, but the following
tag structure is common to all data sets: -- start of a new article ... -- some variety
of "header" tags appears here -- start of the text content of the article -- all paragraph
boundaries are marked by this tag ... -- text data as it is provided by the newswire
service -- end of text content of the article ... -- some variety of "trailer" tags
appears here -- end of article In general, the differences in format among the various
newswire sources will be found in the SGML tags that appear between and , and those
that appear between and . The actual text content of articles (the region between
and ) is consistent in format across sources, except for some uses of the SGML "&..;"
notation to represent special characters in the data. For example, "&MD;" is used
in the "latwp" material to represent the "em-dash" character, which is typically used
to separate the "dateline" from the opening sentence in the first paragraph of each
article. There may also be differences in how quotation marks are rendered. As this
re-release is intended to complement the BLLIP North American News Text releases,
the directory structure of this corpus is identical to that of the BLLIP publications.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634840
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 637-707-612-417-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
North American News Text, General Release
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
North American News Text, General Release, Linguistic Data Consortium (LDC) catalog
number LDC2008T16 and isbn 1-58563-484-0, is a collection of English news text from
the Los Angeles Times, Washington Post, New York Times and Reuters. This data is a
subset of the data contained in the North American News Text Corpus (LDC95T21) released
in 1995 and is reissued to complement the release of the Brown Laboratory for Linguistic
Information Processing (BLLIP) North American News Text sets (LDC2008T13, LDC2008T14),
which consist of Penn Treebank-style parsing of the North American News Text Corpus
text. North American News Text is reissued in two versions: North American News Text,
Complete, LDC2008T15, the members-only original version, now available as a 2008 Membership
Year corpus; and North American News Text, General Release LDC2008T16 (which does
not include text from the Wall Street Journal), available to nonmembers for the first
time. The directory structure of each of these publications has been restructured
to be identical to the directory structure of the BLLIP releases. *Data* The table
below contains a breakdown of the sources, epochs and word counts for the data in
the North American News Text releases: Source Dates # Words (millions) Los Angeles
Times & Washington Post May 1994 - August 1997 52 New York Times News & Syndicate
July 1994 - December 1996 173 Reuters News Service (General and Finanical) April 1994
- December 1996 85 Wall Street Journal (not in General Release) July 1994 - December
1996 40 The New York Times and the Los Angeles Times/Washington Post services include
a range of other newspaper sources in their syndicated newswires. The Los Angeles
Times/Washington Post material in this corpus includes some news text from the following
sources: * Newsday * The Baltimore Sun * The Hartford Courant The New York Times material
in this corpus contains some data from the following sources, although New York Times
articles predominate: * Bloomberg Business News * The Boston Globe * Los Angeles Daily
News * Fort Worth Star-Telegram * Newsweek * Cox News Service * The Arizona Republic
* Seattle Post-Intelligencer * San Francisco Examiner * Houston Chronicle * San Francisco
Chronicle * Economist Newspaper Ltd. * Hearst Newspapers The text content of each
data file (following uncompression with the GNU-unzip utility) consists of plain ASCII
character data with SGML tags to indicate article boundaries and organization of information
within each article. There are differences among the five primary newswire sources
in terms of the number and types of SGML tags used in the text, but the following
tag structure is common to all data sets: -- start of a new article ... -- some variety
of "header" tags appears here -- start of the text content of the article -- all paragraph
boundaries are marked by this tag ... -- text data as it is provided by the newswire
service -- end of text content of the article ... -- some variety of "trailer" tags
appears here -- end of article In general, the differences in format among the various
newswire sources will be found in the SGML tags that appear between and , and those
that appear between and . The actual text content of articles (the region between
and ) is consistent in format across sources, except for some uses of the SGML "&..;"
notation to represent special characters in the data. For example, "&MD;" is used
in the "latwp" material to represent the "em-dash" character, which is typically used
to separate the "dateline" from the opening sentence in the first paragraph of each
article. There may also be differences in how quotation marks are rendered. As this
re-release is intended to complement the BLLIP North American News Text releases,
the directory structure of this corpus is identical to that of the BLLIP publications.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634859
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 741-988-462-570-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLHOME Mandarin Chinese Transcripts - XML version
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CALLHOME Mandarin Chinese Transcripts - XML Version, Linguistic Data Consortium (LDC)
catalog number LDC2008T17 and isbn 1-58563-485-7, was developed at Lancaster University,
United Kingdom. LDC's CALLHOME Mandarin Chinese collection includes telephone speech,
associated transcripts and a lexicon. CALLHOME Mandarin Chinese Speech consists of
120 unscripted telephone conversations between native speakers of Mandarin Chinese.
All calls, which lasted up to thirty minutes, originated in North America and were
placed to locations overseas; most participants called family members or close friends.
CALLHOME Mandarin Chinese Transcripts covers a contiguous five or ten-minute segment
from each of the telephone speech files. The transcripts are in tab-delimited format
with GB2312 encoding, are timestamped by speaker turn for alignment with the speech
signal and are provided in standard orthography. CALLHOME Mandarin Chinese Lexicon
is comprised of over 40,000 words from twenty CALLHOME Mandarin transcripts. CALLHOME
Mandarin Chinese Transcripts - XML Version, the latest addition to this collection,
presents the entire original corpus of 120 transcripts in XML format with UTF-8 encoding,
retokenization and part-of-speech (POS) tagging. The retokenization and POS information
were supplied using the Chinese Lexical Analysis System (ICTCLAS) developed by the
Institute of Computing Technology, Chinese Academy of Sciences, Beijing. ICTCLAS aims
to incorporate Chinese word segmentation, POS tagging, disambiguation and unknown
words recognition into a single theoretical framework using multi-layered hierarchical
hidden Markov models. In addition to the original applications for Mandarin Chinese
CALLHOME data (e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts -
XML Version will be useful in the grammatical study of spoken Mandarin. *Data* This
XML corpus retains all of the linguistic analyses (e.g., timestamps, spoken features
and proper nouns) from the original transcripts release, but the mnemonics used in
the original release were migrated into XML markup following the mapping rules described
below: All analyses in the original release were retained at the sacrifice of tokenization
and part-of-speech tagging accuracy (e.g., some mnemonics encoding spoken features
may split a word, which can affect the tagging accuracy). However, the results of
the automated processing were substantially post-edited. For example, four aspect
markers in Chinese (-le, -guo, -zhe and zai) were disambiguated and corrected by hand;
all of the classifiers (also called "measure words") were re-tagged using a more fine-grained
annotation scheme developed on the Lancaster University project. In addition, a large
number of obvious typographical errors in the original release were corrected in the
process of post-editing. Number of unique words: 6,895 Total number of words: 300,767
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
McEnery, Tony
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xiao, Richard
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634891
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 579-080-587-850-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 contains transcripts and
English translations of 19.1 hours of Chinese broadcast news programming from Voice
of America (VOA), China Central TV (CCTV) and Phoenix TV. GALE Phase 1 Chinese Broadcast
News Parallel Text - Part 3 is the last of the three-part GALE Phase 1 Chinese Broadcast
News Parallel Text, which, along with other corpora, was used as training data in
year 1 (Phase 1) of the DARPA-funded GALE program. LDC has previously released GALE
Phase 1 Chinese Broadcast News Parallel Text - Part 1 and GALE Phase 1 Chinese Broadcast
News Parallel Text - Part 2. *Source Data* A total of 19.1 hours of Chinese broadcast
news recordings were selected from three sources: VOA, CCTV (a broadcaster from Mainland
China) and Phoenix TV (a Hong Kong-based satellite TV station). The transcripts and
translations represent recordings of five different programs. A manual selection procedure
was used to choose data appropriate for the GALE program, namely, news programs focusing
on current events. Stories on topics such as sports, entertainment and business were
excluded from the data set. The following table is a summary of the files included
in this release. Source Program Epoch (YYYY.MM) #hours #characters VOA VOA Mandarin
1997.05.20 and 1997.06.18 2.0 25,051 CCTV CCTV4 Daily News 2005.11-2006.01 4.8 57,376
CCTV4 News3 2005.05-2006.01 2.6 33,361 Phoenix TV Global Report 2005.10-2005.12 5.9
61,339 Good Morning China 2005.12-2006.01 3.8 48,976 The VOA files are named mv970520c.tdf
and mv970618a.tdf. The corresponding speech recordings for the VOA data can be found
in 1997 Mandarin Broadcast News Speech (HUB4-NE), LDC98S73. *Transcription* The selected
audio snippets were carefully transcribed by LDC annotators and professional transcription
agencies following LDC's Quick Rich Transcription specification. Manual sentence units/segments
(SU) annotation was also performed as part of the transcription task. Three types
of end of sentence SU are identified: (1) statement SU, (2) question SU and (3) incomplete
SU. *Translation* After transcription and SU annotation, files were reformatted into
a human-readable translation format and assigned to professional translators for careful
translation. Translators followed LDC's GALE Translation guidelines, which describe
the makeup of the translation team, the source data format, the translation data format,
best practices for translating certain linguistic features (such as names and speech
disfluencies) and quality control procedures applied to completed translations. *TDF
Format* All final data are in Tab Delimited Format (TDF). TDF is compatible with other
transcription formats, such as the Transcriber format and AG format, and it is easy
to process. Each line of a TDF file corresponds to a speech segment and contains 13
tab delimited fields: field data_type; file; channel int; start float; end float;
speaker; speakerType; speakerDialect; transcript; section int; turn int; segment int;
sectionType; and suType. A source TDF file and its translation are the same except
that the transcript in the source TDF is replaced by its English translation. *Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency,
GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not
necessarily reflect the position or the policy of the Government, and no official
endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634883
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 707-184-716-094-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: ISOLET Spoken Letter Database Version 1.3
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CSLU: ISOLET Spoken Letter Database Version 1.3, Linguistic Data Consortium (LDC)
catalog number LDC2008S07 and isbn 1-58563-488-3, was created by the Center for Spoken
Language Understanding (CSLU) at OGI School of Science and Engineering, Oregon Health
and Science University, Beaverton, Oregon. CSLU: ISOLET Spoken Letter Database Version
1.3 is a database of letters of the English alphabet spoken in isolation under quiet
laboratory conditions and associated transcripts. The data was collected in 1990 and
consists of two productions of each letter by 150 speakers (7800 spoken letters) for
approximately 1.25 hours of speech. The subjects were recruited through advertising
and consisted of 75 male speakers and 75 female speakers. Each subject received a
free dessert at a local restaurant in exchange for his or her participation in the
data collection. All speakers reported English as their native language. Their ages
varied from 14 to 72 years; the speakers' average age was 35 years. *Data* Speech
was recorded in the OGI speech recognition laboratory. The room measured 15' by 15'
with a tile floor, standard office wall board and drop ceiling and contained two Sun
workstations and three disk drives. The recording equipment was selected to mimic
the equipment used to collect the TIMIT database as closely as possible. The speech
was recorded with a Sennheiser HMD 224 noise-canceling microphone, low pass filtered
at 7.6 kHz. Data capture was performed using the AT&T DSP32 board installed in a Sun
4/110. The data were sampled at 16 kHz and converted to RIFF(.WAV) format. The subjects
were seated in front of a Sun workstation and prompted with letters in random order.
After each prompt, the subject would strike the return key and say the letter. Two
seconds of speech were recorded and immediately played back for verification. If the
subject spoke too soon or too late and missed the two-second buffer, or if the experimenter
or subject decided that the letter was misspoken, the recording was repeated. There
was no attempt to elicit ideal speech. A letter was judged to be misspoken only if
there was a significant departure from normal pronunciation. After the recording session,
each utterance was verified by a human examiner for two determinations. First, the
examiner viewed a waveform of the utterance to determine that the speech was padded
with silence. The examiner then listened to the speech and noted any ambiguous or
misspoken utterances. All utterances noted by the examiner were examined by two additional
human examiners. If a majority of the examiners perceived that an utterance was abnormal,
that utterance, and the rest of the utterances from that speaker, were removed from
the corpus. The transcriptions of the recorded speech are time-aligned phonetic transcriptions
conforming to the CSLU Labeling standards. Time-aligned word transcriptions are represented
in a standard orthography or romanization. Speech and non-speech phenomena are distinguished.
The transcriptions are aligned to a waveform by placing boundaries to mark the beginning
and ending of words. In addition to the specification of boundaries, this level of
transcription includes additional commentary on salient speech and non-speech characteristics,
such as glottalization, inhalation, and exhalation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Phonology
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Alphabet
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Pronunciation
- General subdivision:
Native speakers
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muthusamy, Y.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fanty, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634867
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 429-488-225-160-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
The New York Times Annotated Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The New York Times Annotated Corpus contains over 1.8 million articles written and
published by the New York Times between January 1, 1987 and June 19, 2007 with article
metadata provided by the New York Times Newsroom, the New York Times Indexing Service
and the online production staff at nytimes.com. The corpus includes: * Over 1.8 million
articles (excluding wire services articles that appeared during the covered period).
* Over 650,000 article summaries written by library scientists. * Over 1,500,000 articles
manually tagged by library scientists with tags drawn from a normalized indexing vocabulary
of people, organizations, locations and topic descriptors. * Over 275,000 algorithmically-tagged
articles that have been hand verified by the online production staff at nytimes.com.
* Java tools for parsing corpus documents from .xml into a memory resident object.
As part of the New York Times' indexing procedures, most articles are manually summarized
and tagged by a staff of library scientists. This collection contains over 650,000
article-summary pairs which may prove to be useful in the development and evaluation
of algorithms for automated document summarization. Also, over 1.5 million documents
have at least one tag. Articles are tagged for persons, places, organizations, titles
and topics using a controlled vocabulary that is applied consistently across articles.
For instance if one article mentions "Bill Clinton" and another refers to "President
William Jefferson Clinton", both articles will be tagged with "CLINTON, BILL". The
New York Times has established a community website for researchers working on the
data set at http://groups.google.com/group/nytnlp and encourages feedback and discussion
about the corpus. *Data* The text in this corpus is formatted in News Industry Text
Format (NITF) developed by the International Press Telecommunications Council, an
independent association of news agencies and publishers. NITF is an XML specification
that provides a standardized representation for the content and structure of discrete
news articles. NITF encompasses structural markup such as bylines, headlines and paragraphs.
The format also provides management attributes for categorizing articles into topics,
summarization usage restrictions and revision histories. The goals of NITF are to
answer the essential questions inherent in news articles: Who, What, When, Where and
Why. * Who: Who owns the copyright, who has rights to republish the article and who
the article is about. * What: The subjects reported, the named entities inside the
article and the events it describes. * When: When the article was written, when it
was issued and when it was revised. * Where: Where the article was written, where
the events took place and where it was delivered. * Why: The metadata describing the
newsworthiness of the article.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sandhaus, Evan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634905
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 206-787-441-605-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
PennBioIE Oncology 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The PennBioIE Oncology Corpus consists of 1414 PubMed abstracts on cancer, concentrating
on molecular genetics, and comprising approximately 327,000 words of biomedical text,tokenized
and annotated for paragraph, sentence, part of speech, and 24 types of biomedical
named entities in five categories of interest. 318 of the abstracts have also been
syntactically annotated. All of the annotation was based on Penn Treebank II standards,
with some modifications for special characteristics of the biomedical text. The entity
definitions were developed and revised in an extensive process of interaction between
domain experts and biomedically trained annotators. The oncology data comprises two
subcorpora: * The Sanger subcorpus (san) consists of abstracts of 577 articles previously
annotated by the Sanger Institute for global mention of oncological named entities.
These annotations were metadata reflecting the presence or absence of such mentions
anywhere in the text, without reference to specific strings. The articles concentrate
on variations in a small set of human genes associated with many different types of
cancer; they were not part of ongoing work at Sanger, and the annotations were never
published. We did not refer to the Sanger annotations after selection of the abstracts.
* The neuroblastoma subcorpus (nb) consists of 837 abstracts of articles dealing with
this particular type of cancer selected by colleagues at Children's Hospital of Philadelphia.
They do not all concentrate on genetics, but they mention a much larger number of
genes than the Sanger files do. The data was prepared by the Linguistic Data Consortium
for the Institute for Research in Cognitive Science, with funding from the National
Science Foundation under Grant No. ITR EIA-0205448, Information Technology Research
(ITR) program, in collaboration with Dr. Peter White's group in Pediatric Oncology
at the Children's Hospital of Philadelphia. *Data Description* The corpus contains
1412 PubMed abstracts comprising approximately 381,000 total words of text. Each file
has been tokenized and its biomedical portions (327,000 words) exhaustively annotated
for paragraph, sentence, and part of speech, and non-exhaustively annotated for 16
("Level 1") or 23 ("Level 2") types of named entity. Each token has a part-of-speech
tag. Tokens and POS tags: Tokens in biomedical and chemical notation and terms, and
spelled-out numbers, may contain whitespace and/or punctuation ("beta, 20 diol", "(Na+
+ K+)ATPase", "two hundred seven"); and named entity mentions may comprise several
tokens ("polychlorinated biphenyl preparations"). Tokens and entities do not span
sentence boundaries. Biomedical and non-biomedical text: The title and body of each
abstract are considered to be biomedical text, and the automatic and manual annotations
in them have been extensively curated. Everything else, such as citation information
and author names, is considered non-biomedical; this has not been entity annotated,
and its automated tokenization and part of speech tags have not been curated and are
known to be unreliable. In non-biomedical text, the tag "section" is used instead
of "sentence", allowing users to include or exclude these parts. There are approximately
274,000 words of biomedical text and 54,000 words of non-biomedical text. (Because
of a problem with software maintenance, about 24,000 tokens in biomedical text, mostly
in the nb2 subcorpus, are missing POS tags.) Domains: The abstracts are divided across
two domains: * the molecular genetics of cancer, from a list selected by the Cancer
Genome Project of the Sanger Institute (v0.9: 588 files; v1.0: 577 files) * neuroblastoma,
a type of cancer that develops from nerve tissue in infants and children (v0.9: 569
files; v1.0: 837 files = 392 from v0.9 + 445 new) The difference between the domains
is apparent in the ratio of distinct mentions (types) of tumor types and of gene,
after normalization: 3.5 times as many tumor types in the Sanger files, but 5.8 times
as many genes in the neuroblastoma files. Other divisions of the corpus: The files
are further subdivided by annotation level into three subcorpora, each with its own
subdirectory on this CD and its own set of metadata files. * nb1: neuroblastoma annotated
to level 1 (407 files) * nb2: neuroblastoma annotated to level 2 (430 files) * san:
Sanger annotated to level 2 (all 577 files) Metadata is also provided for * onco:
the entire v1.0 oncology corpus (1414 files) * nb: nb1 + nb2, all the neuroblastoma
data regardless of annotation level (837 files) * o2: nb2 + san, all the level 2 data
regardless of subcorpus (1007 files) Version 0.9 is included in this release in a
separate directory. It is similarly organized, though with only one level of annotation,
less detailed than v1.0's level 1: * onco09: all the v0.9 oncology corpus (1157 files)
* nb09: neuroblastoma (569 files) * san09: Sanger (588 files) A subset of the v0.9
data was also syntactically annotated (treebanked): * onco09t: (318 files) * nb09t:
(115 files) * san09t: (203 files) *Principles and Methods* Many annotation projects
start with an already annotated corpus, such as the Penn Treebank or the Brown Corpus,
which is treated as unchangeable. As a result, annotation practices have sometimes
involved compromises which might not have been necessary if the earlier annotation
had been able to integrate the requirements of the later work. Such integration is
necessary here because of the scope of this project, involving highly technical biomedical
texts, entity definitions driven by the needs of biomedical research, and the goal
of making the annotation layers work together as much as possible, e.g., using entity
information in the treebank annotation of prenominal modifiers. Such integration is
also possible given the relatively long term of the grant (five years) and because
researchers were starting with fresh text, applying all layers of annotation themselves.
The texts are annotated at the following layers: * Paragraph * Sentence * Biomedical
entity * Token and part of speech * Syntax (treebanking) (some texts only) * Semantic
relations (some oncology texts only) Paragraph, sentence, tokenization, POS, and syntactic
annotation (treebanking) are applied by automatic taggers and manually corrected;
entity annotation is manual. The authors originally used a POS tagger trained on Penn
Treebank data, which made many errors on the very different text of these biomedical
abstracts. When there was enough manually-corrected data to train a tagger, overall
accuracy rose from 88.53% to 97.33% (Kulick et al. 2004 (slides)). Annotation at all
layers except entity is based on the Penn Treebank II guidelines, with a number of
modifications that have been found necessary, many of which were subsequently adopted
by the Penn Treebank. Entity definitions came originally from domain experts and were
developed and refined in dialogue with the annotators.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Oncology
- Form subdivision:
Terminology
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Oncology
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Oncology
- Form subdivision:
Databases.
- General subdivision:
Abstracting and indexing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liberman, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mandel, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
White, Peter
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634921
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 652-357-402-514-3
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NomBank is an annotation project at New York University that provides argument structure
for instances of common nouns in the Penn Treebank (Treebank-2 and Treebank-3). NomBank
1.0 was released by the Nombank project in December 2007. It covers all of the "markable"
nouns in the Penn Treebank Wall Street Journal data. Specifically, it includes 114,576
propositions that were derived from looking at a total of 202,965 noun instances and
choosing only those nouns who arguments occur in the text. The work of NomBank is
related to the PropBank project at the University of Colorado. NomBank marks the sets
of arguments that cooccur with nouns in Proposition Bank I, just as PropBank records
such information for verbs. Related resources and further information about NomBank
are available from the NomBank project website. *Data* NomBank v. 1.0 is a human-readable
version of NomBank 1.0. It contains data with licenses that are owned or managed by
the Linguistic Data Consortium. A license to either Treebank-2 or Treebank-3 is required
in order to obtain NomBank v. 1.0.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meyers, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Reeves, Ruth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Macleod, Catherine
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 419-167-670-549-0
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
COMNOM is an automatically enriched version of COMLEX Syntax that was created at New
York University as part of the NomBank annotation project. COMLEX resources are distributed
by the Linguistic Data Consortium (LDC) and consist of the following: COMLEX English
Syntax Lexicon (LDC98L21), an English dictionary consisting of approximately 38,000
lemmas with detailed information about the syntactic characteristics of each lexical
item and subcategorization (complement structures); and COMLEX Syntax Text Corpus
Version 2.0 (LDC96T11). COMNOM adds classes to COMLEX Syntax lexical entries using
NOMLEX-PLUS, a dictionary with approximately 8,000 entries. COMNOM collected prepositions
from NOMLEX-PLUS sub-categorizations (:VERB-SUBC, :OBJECT, :SUBJECT, etc.), deduced
essential complements from them and added them to the existing COMLEX entry. Further
information about the methodology used in COMNOM can be found in Meyers, "Those Other
NomBank Dictionaries -- Manual for Dictionaries that Come with NomBank". Related resources
and further information about COMNOM and NomBank are available from the Nom Bank project
website. A license to COMLEX English Syntax Lexicon (LDC98L21) or COMLEX Syntax Text
Corpus Version 2.0 (LDC96T11) is required in order to obtain COMNOM v. 1.0. *Data*
This release includes three versions of COMNOM which correspond to the three versions
of NOMLEX-PLUS and are characterized by the amount of corpus training that influenced
their creation. The data used for training are the Wall Street Journal materials in
the Penn Treebanks (Treebank-2 and Treebank-3), with annotations from Proposition
Bank I and NomBank 1.0. The three versions are: * COMNOM-clean.1.0 -- contains no
information derived from annotated data * COMNOM.1.0 -- contains information from
the entire annotated corpus * COMNOM-training.1.0 -- contains information from annotated
data in sections 02-21 of the corpus only.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meyers, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Reeves, Ruth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Macleod, Catherine
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634948
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008T25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 541-800-982-573-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
AQUAINT-2 Information-Retrieval Text Research Collection
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008T25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium
(LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's
(National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA)
track. It consists of approximately 2.5 GB of English news text from six distinct
sources collected by LDC (Agence France Presse, Associated Press, Central News Agency
(Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency)
covering the period from October 2004 through March 2006. The AQUAINT-2 collection
is the second part of a series intended to provide data useful for developing, evaluating
and testing information extraction and retrieval systems. It follows the publication
of The AQUAINT Corpus of English News Text (LDC2002T31). The AQUAINT (Advanced Question-Answering
for Intelligence) program addresses interactivity with scenarios or tasks. The scenario
provides a context in which questions will be asked and answered, and the task reflects
the overall assignment. The program is committed to solve a single problem: how to
find topically relevant, semantically related, timely information in massive amounts
of data in diverse languages, formats, and genres. AQUAINT technology is advancing
the development of components and functions that allows users to pose a series of
intertwined, complex questions and obtain comprehensive answers in the context of
broad information-gathering tasks. In addition, while most information retrieval systems
present only links to documents, AQUAINT is producing technology that will present
answers to the user's questions. This question-answering technology is being developed
with features for managing semantic similarity, co-reference, event characterization,
opinions, linguistic and social and world inferencing, redundancy, deception, and
missing or contradictory information. In order to allow the analyst to guide the exploration
in concert with the machine, AQUAINT technology employs interactive question-answering,
the automatic suggestion of additional paths of exploration, and the inferencing of
the social context of the information search. *Data* AQUAINT-2 Information-Retrieval
Text Research Collection is a subset of LDC's English Gigaword Third Edition (LDC2007T07).
The collection comprises approximately 2.5 GB of text (about 907K documents) spanning
the time period October 2004 - March 2006. For each source, all of the usable data
collected by LDC was processed into a consistent XML format in which the stories for
a given month are concatenated in chronological order into a single "DOCSTREAM" element;
each story is a single "DOC" element within that stream and has a globally unique
"id" attribute. The collection consists of newswire data in English drawn from six
distinct sources, listed below in terms of their file name designations and full names:
afp_eng Agence France Presse apw_eng Associated Press cna_eng Central News Agency
(Taiwan) English Service ltw_eng Los Angeles Times - Washington Post News Service
nyt_eng New York Times xin_eng Xinhua News Agency (Beijing) English Service
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Written English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Information retrieval
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Vorhees, Ellen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008T25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634956
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008S08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 857-539-187-188-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
LDC Spoken Language Sampler
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes
a wide and growing assortment of resources for researchers, engineers and educators
whose work is concerned with human languages. Historically, most linguistic resources
were not generally available to interested researchers but were restricted to single
laboratories or to a limited number of users. Inspired by the success of selected
readily available and well-known data sets, such as the Brown University text corpus,
LDC was founded in 1992 to provide a new mechanism for large-scale corpus development
and sharing of resources. In 2008, LDC is a growing consortium that includes more
than 100 companies, universities, and government members that has distributed over
50,000 corpora to a global audience. With the support of its members, LDC is able
to provide critical services to the language research community. These services include:
maintaining the data archives, producing and distributing data via media (DVD-ROM
or CD-ROM) or web downloads, negotiating intellectual property agreements with potential
information providers and would-be members, and maintaining relations with other like-minded
groups around the world. Resources available from LDC (http://www.ldc.upenn.edu) include
speech, text and video data and lexicons in multiple languages, as well as software
tools to facilitate the use of corpus materials. *Data* The LDC Spoken Language Sampler
provides a variety of speech, transcript and lexicon samples and is designed to illustrate
the variety and breadth of the resources available from LDC Publication Catalog. *
most excerpts are truncated to be much shorter than the original files, typically
one minute and thirty seconds of speech * signal amplitude has been adjusted where
necessary to normalize playback volume * some corpora are published in compressed
form, but all samples here are uncompressed * LDC typically uses NIST SPHERE file
format for audio data, but the audio files in this sampler have been converted to
MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities.
The sampler includes samples from the following corpora and lexicons. Audio samples
range from 30 seconds to 90 seconds and are accompanied by transcripts. An English
Dictionary of the Tamil Verb This dictionary contains translations for over 6000 English
verbs and defines over 9000 Tamil verbs. Entries include the English word, the Tamil
equivalent in transliteration and Tamil script and audio examples in Spoken Tamil
pronunciation. CALLFRIEND Farsi A corpus of 60 unscripted telephone calls between
friends and acquaintances speaking in their native language, Farsi. CALLFRIEND Tamil
A corpus of 60 unscripted telephone calls between friends and acquaintances speaking
in their native language, Tamil. CALLHOME Japanese A corpus of 120 unscripted telephone
conversations between native Japanese speakers and a corpus of associated transcripts.
CALLHOME Spanish A corpus of 120 unscripted telephone conversations between native
Spanish speakers and a corpus of associated transcripts. CSLU Kids Speech Developed
at Oregeon State Universitys Center for Spoken Language Understanding, this corpus
is a collection of spontaneous and prompted speech from 1100 children from Kindergarten
through Grade 10. Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone
conversations and transcripts from speakers of several nationalities. Grassfields
Bantu Fieldwork: Dschang Tone Paradigms Tone paradigms from Yémba (Bamileke Dschang),
a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern
Cameroon. Gulf Arabic Conversational Telephone Speech Contains 975 telephone conversations
from speakers across the Persian Gulf region and their transcriptions. Korean Telephone
Speech Collection of 100 telephone conversations between native Korean speakers and
their transcriptions. Mawukakan Lexicon The first publication of an ongoing project
aiming to build an electronic dictionary of four Mandekan [Eastern Manding languages
of the Mande Group of the Niger-Congo family] languages. Nationwide Speech Project
A database of speech representing current regional accents and dialects of the United
States. NIST Pilot Meeting Speech Collects speech and transcriptions from topical
discussions in meeting settings including complete descriptive metadata and detailed
descriptions of the physical environment in which the discussions took place. West
Point Russian Speech Utterances of sentences in Russian from 1,891 native and non-native
speakers. *How to Obtain* The LDC Spoken Language Sampler may be downloaded freely.
The sampler is a Gnu zipped tar file. Most compression utilities will readily extract
the sampler. Download 74 mb
LANGUAGE NOTE
- Language note:
Content in . Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Castelletto, Anthony
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635014
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 144-817-035-468-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: Numbers Version 1.3
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CSLU: Numbers Version 1.3, Linguistic Data Consortium (LDC) catalog number LDC2009S01
and isbn 1-58563-501-4, was created by the Center for Spoken Language Understanding
(CSLU) at OGI School of Science and Engineering, Oregon Health and Science University,
Beaverton, Oregon. It is a collection of naturally produced numbers taken from utterances
in various CSLU telephone speech data collections. The corpus consists of approximately
fifteen hours of speech and includes isolated digit strings, continuous digit strings,
and ordinal/cardinal numbers. The numbers have several sources, among them, phone
numbers, numbers from street addresses and zip codes, uttered by 12618 speakers in
a total of 23902 files. In most of CSLU's telephone data collections, callers were
asked for their phone number, birthdate or zip code. Callers would also occasionally
leave numbers in the midst of another utterance. The numbers in those situations were
extracted from the host utterance and added to the corpus. Additional information
about this publication is available from the corpus web page at CSLU. * Data:* The
speech data was collected over analog and digital telephone lines. The analog data
was recorded using a Gradient Technologies analog-to-digital conversion box; those
files were recorded as 16-bit, 8 khz and stored in a linear format. The digital data
was recorded with the CSLU T1 digital data collection system; those files were sampled
at 8khz, 8-bit and stored as ulaw files. All of the data in this release has been
linearly encoded in 16-bit RIFF standard file format. Each file includes an orthographic
transcription following the CSLU Labeling guidelines which are included in the documentation
for this publication. Also, many of the utterances have been phonetically labeled.
* Statistics: * CSLU: Numbers Version 1.3 consists of approximately fifteen hours
of speech. The following table gives a count of the number of files for each utterance
type. Type Number phone 2970 street 7079 zipcode 7076 other 6771
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lander, T.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Durham, T.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2008 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634972
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2008S09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 726-472-023-584-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CHAracterizing INdividual Speakers (CHAINS)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2008]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2008S09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CHAINS was created by researchers at University College Dublin and contains recordings
of thirty-six English speakers reading fables and selected sentences in different
speaking styles. The data was obtained in two different sessions with a time separation
of about two months. The goal of the corpus is to provide a range of speaking styles
and voice modifications for speakers sharing the same accent. Other existing corpora,
in particular CSLU Speaker Recognition Version 1.1, TIMIT and the IViE corpus (English
Intonation in the British Isles), served as referents in the selection of material.
This design decision was made to ensure that methods designed and evaluated on the
CHAINS corpus might be directly testable on these other corpora, which were recorded
using quite different dialects and channel characteristics. Additional documentation
about the corpus and its methodolgy is available at the CHAINS website. *Data* The
data was collected in two recording sessions in a total of six different speaking
styles. The first recording session was carried out in a professional recording studio
in December 2005. Speakers were recorded in a sound-attenuated booth reading text
in the solo, synchronous and retell styles using a Neumann U87 condenser microphone.
Additional tracks using other microphones (near and far-field) were also recorded
and may be made available upon request to the authors. The second recording session
took place from March 2006 to May 2006 in a quiet office environment, using an AKG
C420 headset condenser microphone. Speakers read text in the rsi, whisper and fast
modes. The six different speaking styles were: * solo reading * synchronous reading
* spontaneous speech (retell) * reptitive synchronous imitation (rsi) * whispered
fast reading * fast speech reading In two of the speaking conditions adopted, speakers
modified their speech in a constrained fashion towards a known target in the synchronous
condition, the speech of the co-speaker served as a target, while in rsi, there was
an explicit known static target. The presence of a known target which speakers aim
to copy raises the bar in the discovery and design of procedures for automatic speaker
identification, as the target speech provides a potentially highly confusing foil.
The whisper and fast speech conditions are also well defined speaking styles which
require substantial voice modification by the speaker. Participants were recruited
through the University College Dublin and were paid for their participation. No participant
had any known speech or hearing deficit. The speakers were from the United Kingdom,
the eastern part of Ireland (Dublin and adjacent counties) and the United States.
Further information about the speakers, their gender and dialect is available in the
documentation released with this corpus.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cummins, Fred
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimaldi, Marco
ADDED ENTRY--PERSONAL NAME
- Personal name:
Leonard, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Simko, Juraj
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2008S09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 131-245-215-805-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
abv
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 was prepared by the Linguistic
Data Consortium (LDC) and contains a total of 145,000 words (263 files) of Arabic
newsgroup text and its translation selected from thirty-five sources. Newsgroups consist
of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar
forums. This release was used as training data in Phase 1 (year 1) of the DARPA-funded
GALE program. This is the second of a two-part release. GALE Phase 1 Arabic Newsgroup
Parallel Text - Part 1 was releasd in early 2009. LDC has released the following GALE
Phase 1 & 2 Arabic Parallel Text data sets: * GALE Phase 1 Arabic Broadcast News Parallel
Text - Part 1 (LDC2007T24) * GALE Phase 1 Arabic Broadcast News Parallel Text - Part
2 (LDC2008T09) * GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02) * GALE Phase
1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03) * GALE Phase 1 Arabic Newsgroup
Parallel Text - Part 2 (LDC2009T09) * GALE Phase 2 Arabic Broadcast Conversation Parallel
Text Part 1 (LDC2012T06) * GALE Phase 2 Arabic Broadcast Conversation Parallel Text
Part 2 (LDC2012T14) * GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17) * GALE
Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18) * GALE Phase 2 Arabic Web
Parallel Text (LDC2013T01) *Source Data* Preparing the source data involved four stages
of work: data scouting, data harvesting, formatting and data selection. Data scouting
involved manually searching the web for suitable newsgroup text. Data scouts were
assigned particular topics and genres along with a production target in order to focus
their web search. Formal annotation guidelines and a customized annotation toolkit
helped data scouts to manage the search process and to track progress. The data scouting
process is described in the GALE task specification. Data scouts logged their decisions
about potential text of interest (sites, threads and posts) to a database. A nightly
process queried the annotation database and harvested all designated URLs. Whenever
possible, the entire site was downloaded, not just the individual thread or post located
by the data scout. Once the text was downloaded, its format was standardized (by running
various scripts) so that the data could be more easily integrated into downstream
annotation processes. Original-format versions of each document were also preserved.
Typically, a new script was required for each new domain name that was identified.
After scripts were run, an optional manual process corrected any remaining formatting
problems.The selected documents were then reviewed for content-suitability using a
semi-automatic process. A statistical approach was used to rank a document's relevance
to a set of already-selected documents labeled as "good." An annotator then reviewed
the list of relevance-ranked documents and selected those which were suitable for
a particular annotation task or for annotation in general. These newly-judged documents
in turn provided additional input for the generation of new ranked lists. Manual sentence
unit/segment (SU) annotation was also performed on a subset of files following LDC's
Quick Rich Transcription guidelines. Three types of end of sentence SU were identified:
* statement SU * question SU * incomplete SU *Translation* After files were selected,
they were reformatted into a human-readable translation format and assigned to professional
translators for careful translation. Translators followed LDC's GALE translation guidelines,
which describe the makeup of the translation team, the source data format, the translation
data format, best practices for translating certain linguistic features (such as names
and speech disfluencies) and quality control procedures applied to completed translations.
*Final Data* A source file and its translation share the same file name across directories.
TDF Format All final data are presented in Tab Delimited Format (TDF). TDF is compatible
with other transcription formats, such as the Transcriber format and AG format, making
it easy to process. Each line of a TDF file corresponds to a speech segment and contains
13 tab delimited fields: field data_type 1 file unicode 2 channel int 3 start float
4 end float 5 speaker unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript
unicode 9 section int 10 turn int 11 segment int 12 sectionType unicode 13 suType
unicode A source TDF file and its translation are the same except that the transcript
in the source TDF is replaced by its English translation. Some fields are inapplicable
to newsgroup text. Those include the channel, start time, end time and speaker dialect
fields. Those fields are either empty or contain values as place holder. Encoding
All data are encoded in UTF-8. *Sponsorship* This work was supported in part by the
Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or the policy
of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Baharna Arabic, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zakhary, Dalal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 661-985-397-395-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English CTS Treebank with Structural Metadata
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
English CTS Treebank with Structural Metadata, Linguistic Data Consortium (LDC) catalog
number LDC2009T01 and isbn 1-58563-476-X, consists of metadata and syntactic structure
annotations for 144 English telephone conversations, or 140,000 words, from data used
in the EARS (Effective, Affordable, Reusable Speech-to-Text program. English CTS Treebank
with Structural Metadata was created to support EARS work in English. It applies EARS
metadata extraction annotations and Penn Treebank methods to conversations from Switchboard-1
Release 2 (LDC97S62) and from data collected for EARS under the Fisher Protocol (released
in EARS as LDC2004E16, LDC2004E29 and LDC2005E73). The purpose of the EARS program
was to develop robust speech recognition technology to address a range of languages
and speaking styles. LDC provided conversational and broadcast speech and transcripts,
annotations, lexicons and texts for language modeling in each of the EARS languages
(Arabic, Chinese, English). LDC also supported a metadata extraction (MDE) research
evaluation, the goal of which was to enable technology to take raw speech-to-text
(STT) output and to refine it into forms of more use to humans and to downstream automatic
processes. In simple terms, this means the creation of automatic transcripts that
are maximally readable. This readability might be achieved in a number of ways: removing
non-content words like filled pauses and discourse markers from the text; removing
sections of disfluent speech; and creating boundaries between natural breakpoints
in the flow of speech so that each sentence or other meaningful unit of speech might
be presented on a separate line within the resulting transcript. Natural capitalization,
punctuation and standardized spelling, plus sensible conventions for representing
speaker turns and identity are further elements in the readable transcript. Some of
the data developed by LDC for the MDE task is contained in the LDC Catalog, i.e.,
RT-04 MDE Training Data Speech, LDC2005S16 and RT-04 MDE Training Data Text/Annotations,
LDC2005T24. *Data* Speech The telphone speech used in English CTS Treebank with Structural
Metadata was drawn from Switchboard-1 Release 2 (LDC97S62) and from data collected
for EARS under the Fisher Protocol (released in EARS as LDC2004E16, LDC2004E29 and
LDC2005E73). The speech for all files was recorded on two channels with a sampling
rate of 8000 Hz and was encoded in ulaw format. The Fisher data was transcribed by
LDC staff; for the Switchboard data, transcripts developed at the Institute for Signal
and Information Processing at Mississippi State University were used. Structural Metadata
Annotation The transcribed data was annotated to SimpleMDE V6.2 , an annotation task
defined by LDC that consisted of the following elements: Edit Disfluencies (repetitions,
revisions, restarts and complex disfluencies), Fillers (including, e.g., filled pauses
and discourse markers) and SUs, or syntactic/semantic units. Each of these elements
is described below: * Edit Disfluencies: Edit disfluencies, or speech repairs, occur
when speakers correct or alter their utterances or abandon them entirely and start
over. Edit disfluencies have a more complex internal structure than fillers, consisting
of the original utterance (reparandum), an interruption point, an optional editing
phase and a correction. There are four types of disfluencies annotated in SimpleMDE:
repetitions; revisions; restarts; and complex disfluencies, which consist of multiple
or nested edits. In SimpleMDE, annotators labeled only the deletable region (DELREG)
of the disfluency which corresponded to the reparandum. In cases where the reparandum
contained multiple disfluent utterances, annotators identified the maximal extent
of the disfluent portion, starting with the left edge of the first disfluency and
continuing to the right edge (IP) of the final disfluency. * Fillers: While the term
filler has traditionally been synonymous with filled pause, SimpleMDE uses the term
to encompass a broad set of vocalized space-fillers: filled pauses (FPs), discourse
markers (DMs), explicit editing terms (EETs) and asides/parentheticals (A/Ps). Excepting
the last category, fillers can be understood as words that do not alter the propositional
content of the material into which they are inserted. For example, FPs include nonlexemes,
such as um or ah, that speakers use to indicate hesitation or to maintain control
of a conversation. A DM is a word or phrase that functions primarily as a structuring
unit of spoken language, such as actually, now, anyway, see, basically, so, I mean,
well, let's see, you know, like, you see. DMs often signal the speaker's intention
to mark a boundary in discourse, like a change in speaker or the beginning of a new
topic. There is no exhaustive list of DMs for a given language due to their wide range
of functions, colloquial variations, and the difficulty of defining them precisely.
Furthermore, words that are used as discourse markers can be used for other purposes.
EETs occur during an edit disfluency and consist of an overt statement (e.g., I'm
sorry) from the speaker recognizing the disfluency. Asides and parentheticals (A/Ps)
are different from the other filler types in that they convey semantic information
in the form of a short side comment before returning to the main topic. This may be
either on a new topic (asides) or on the same topic of the larger utterance (parentheticals).
Both break up the stream of discourse and are often accompanied by noticeable prosodic
features. * Syntactic Units: One of the goals of MDE annotation is the identification
of all units within the discourse that function to express a complete thought or idea
on the part of the speaker.Within MDE these elements are called SUs (Syntactic, Semantic
or Slash Units). As with disfluency annotation, the goal of SU labeling is to improve
transcript readability by presenting information in small, structured, coherent chunks.
There are four sentence-level SUs. Statements are complete SUs that function as a
declarative statement and are marked with /.; questions are complete SUs that function
as an interrogative and are marked with /?. Backchannels are an open class of words
uttered by the non-dominant speaker to indicate engagement in the conversation and
are marked with /@. Incomplete SUs occur when an utterance does not constitute a grammatically
complete sentence, phrase or continuer, and does not express a complete thought; these
are marked with /-. To enhance inter-annotator consistency, there are also sentence-internal
clausal and coordinating SUs (/, and /&). Parsing and Treebank Annotation The existing
MDE annotations were converted from RTTM format into a format appropriate for the
automatic parser, enabling the generation of accurate parses in a form that would
require as little hand modification by the Treebank team as possible. RTTM is a format
developed by NIST (National Institute for Standards and Technology) for the EARS program
that labeled each token in the reference transcript according to the properties it
displays (e.g., lexeme versus non-lexeme, edit, filler, SU). The initial parse trees
were produced using an entropy-based parser, which was trained on Switchboard transcripts
supplemented with Wall Street Journal data (with a 4:1 ratio). These parses served
as the starting point for a manual process which corrected the initial pass for each
conversation. To provide high quality parses, scripts were used to separate the edited
material from the fluent part of each SU prior to parsing it using the MDE annotations.
The edits were then parsed and reinserted into the tree for presentation to the annotators.
Some important issues are listed below: * Words were tokenized in Syntactic Units
using LDC's scripts. * All of the punctuation provided in the markup was maintained
in the SU for parsing because it was likely to enhance parse accuracy and was expected
to appear in the final tree annotations. * For parsing complex edits, contiguous edits
were concatenated into one unit for parsing. In a few cases, edits occur across SUs
in MDE annotations. * Special treatment was required in the scripts for regions unannotated
for MDE, complex edits, and SUs that were comprised solely of edited material. * The
string was "EDITED" as the non-terminal tag for edit regions inserted into the fluent
parse trees. Additionally a terminal node for the IP ((DISFL-IP +) was added at the
end of the edits in an attempt to make the tree follow the conventions used in the
Switchboard Treebank. Manual treebank annotation was performed in accordance with
existing treebank guidelines for conversational telephone speech as well as in accordance
with revised general guidelines for treebanking.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Christopher
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634999
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 306-125-545-782-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1, Linguistic Data
Consortium (LDC) catalog number LDC2009T02 and ISBN 1-58563-499-9, contains transcripts
and English translations of 20.4 hours of Chinese broadcast conversation programming
from China Central TV (CCTV) and Phoenix TV. It does not contain the audio files from
which the transcripts and translations were generated. GALE Phase 1 Chinese Broadcast
Conversation Parallel Text - Part 1, along with other corpora, was used as training
data in year 1 (Phase 1) of the DARPA-funded GALE program. *Source Data:* A total
of 20.4 hours of Chinese broadcast conversation programming were selected from two
sources: CCTV (a broadcaster from Mainland China), and Phoenix TV (a Hong Kong -based
satellite TV station). The transcripts and translations represent recordings of eight
different programs. A manual selection procedure was used to choose data appropriate
for the GALE program, namely conversation (talk) programs focusing on current events.
Stories on topics such as sports, entertainment and business were excluded from the
data set. The following table is a summary of the files included in this release.
Source Program Epoch (YYYY.MM) #hours #characters CCTV Across China 2005.08 1.0 9,924
Todays Focus 2005.11 2.2 33,805 Phoenix TV Asian Journal 2005.09 2.2 26,656 Behind
the Headlines 2005.03 - 2005.11 1.5 17,933 A Date With Lu Yu 2005.09 - 2005.10 7.1
89,987 News Hacker 2005.03 - 2005.10 2.3 39,388 Newsline 2005.10 - 2005.11 1.6 15,496
Social Watch 2005.09 - 2005.11 2.5 29,159 *Transcription:* The selected audio snippets
were carefully transcribed by LDC annotators and professional transcription agencies
following LDCs Quick Rich Transcription specification. Manual sentence units/segments
(SU) annotation was also performed as part of the transcription task. Three types
of end of sentence SU are identified: * statement SU * question SU * incomplete SU
*Translation:* After transcription and SU annotation, files were reformatted into
a human-readable translation format and assigned to professional translators for careful
translation. Translators followed LDCs GALE Translation guidelines which describe
the makeup of the translation team, the source data format, the translation data format,
best practices for translating certain linguistic features (such as names and speech
disfluencies) and quality control procedures applied to completed translations. *TDF
Format:* All final data are in Tab Delimited Format (TDF). TDF is compatible with
other transcription formats, such as the Transcriber format and AG format, and it
is easy to process. Each line of a TDF file corresponds to a speech segment and contains
13 tab delimited fields: Field Data Type file unicode channel int start float end
float speaker unicode speakerType unicode speakerDialect unicode transcript unicode
section int turn int segment int sectionType unicode suType unicode A source TDF file
and its translation are the same except that the transcript in the source TDF is replaced
by its English translation. *Sponsorship* This work was supported in part by the Defense
Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content
of this publication does not necessarily reflect the position or the policy of the
Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585634964
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009V01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 121-605-639-540-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Audiovisual Database of Spoken American English
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009V01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Audiovisual Database of Spoken American English, Linguistic Data Consortium (LDC)
catalog number LDC2009V01 and isbn 1-58563-496-4, was developed at Butler University,
Indianapolis, IN in 2007 for use by a a variety of researchers to evaluate speech
production and speech recognition. It contains approximately seven hours of audiovisual
recordings of fourteen American English speakers producing syllables, word lists and
sentences used in both academic and clinical settings. All talkers were from the North
Midland dialect region -- roughly defined as Indianapolis and north within the state
of Indiana -- and had lived in that region for the majority of the time from birth
to 18 years of age. Each participant read 238 different words and 166 different sentences.
The sentences spoken were drawn from the following sources: * Central Institute for
the Deaf (CID) Everyday Sentences (Lists A-J) * Northwestern University Auditory Test
No. 6 (Lists I-IV) * Vowels in /hVd/ context (separate words) * Texas Instruments/Massachusetts
Institute for Technology (TIMIT) sentences The CID Everyday Sentences were created
in the 1950s from a sample developed by the Armed Forces National Research Committee
on Hearing and Bio-Acoustics. They are considered to represent everyday American speech
and have the following characteristics: the vocabulary is appropriate to adults; the
words appear with high frequency in one or more of the well-known word counts of the
English language; proper names and proper nouns are not used; common non-slang idioms
and contractions are used freely; phonetic loading and "tongue-twisting" are avoided;
redundancy is high; the level of abstraction is low; and grammatical structure varies
freely. Northwestern University Auditory Test No. 6 is a phonemically-balanced set
of monosyllabic English words used clinically to test speech perception in adults
with hearing loss. The /hVd/ vowel list was created to elicit all of the vowel sounds
of American English. The TIMIT sentences are a subset (34 sentences) of the 2342 phonetically-rich
sentences read by speakers in the TIMIT Acoustic-Phonetic Continuous Speech Corpus
LDC93S1. TIMIT was designed to provide speech data for the acquisition of acoustic-phonetic
knowledge and for the development and evaluation of automatic speech recognition systems.
TIMIT speakers were from eight dialect regions of the United States. The Audiovisual
Database of Spoken American English will be of interest in various disciplines: to
linguists for studies of phonetics, phonology, and prosody of American English; to
speech scientists for investigations of motor speech production and auditory-visual
speech perception; to engineers and computer scientists for investigations of machine
audio-visual speech recognition (AVSR); and to speech and hearing scientists for clinical
purposes, such as the examination and improvement of speech perception by listeners
with hearing loss. *Data* Participants were recorded individually during a single
session. A participant first completed a statement of informed consent and a questionnaire
to gather biographical data and then was asked by the experimenter to mark his or
her Indiana hometown on a state map. The experimenter and participant then moved to
a small, sound-treated studio where the participant was seated in front of three navy
blue baffles. A laptop computer was elevated to eye-level on a speaker stand and placed
approximately 50-60 cm in front of the participant. Prompts were presented to the
participant in a Microsoft PowerPoint presentation. The experimenter was seated directly
next to the participant, but outside the camera angle, and advanced the PowerPoint
slides at a comfortable pace. Participants were recorded with a Panasonic DVC-80 digital
video camera to miniDV digital video cassette tapes. All participants wore a Sennheiser
MKE-2060 directional/cardioid lapel microphone throughout the recordings. Each speaker
produced a total of 94 segmented files which were converted from Final Cut Express
to Quicktime (.mov) files and then saved in the appropriately marked folder. If a
speaker mispronounced a sentence or word during the recording process, the mispronunciations
were edited out of the segments to be archived. The remaining parts of the recording,
including the correct repetition of each prompt, were then sequenced together to create
a continuous and complete segment. The fourteen participants were between 19 and 61
years of age (with a mean age of 30 years) and native speakers of American English.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Richie, Carolyn
ADDED ENTRY--PERSONAL NAME
- Personal name:
Warburton, Sarah
ADDED ENTRY--PERSONAL NAME
- Personal name:
Carter, Megan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009V01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635065
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 146-868-323-212-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1, Linguistic Data Consortium (LDC)
catalog number LDC2009T03 and isbn 1-58563-506-5, was prepared by LDC and contains
a total of 178,000 words (264 files) of Arabic newsgroup text and its translation
selected from thirty-five sources. Newsgroups consist of posts to electronic bulletin
boards, Usenet newsgroups, discussion groups and similar forums. This release was
used as training data in Phase 1 (year 1) of the DARPA-funded GALE program. LDC has
released the following GALE Phase 1 & 2 Arabic Parallel Text data sets: * GALE Phase
1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24) * GALE Phase 1 Arabic
Broadcast News Parallel Text - Part 2 (LDC2008T09) * GALE Phase 1 Arabic Blog Parallel
Text (LDC2008T02) * GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03)
* GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09) * GALE Phase 2
Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06) * GALE Phase 2 Arabic
Broadcast Conversation Parallel Text Part 2 (LDC2012T14) * GALE Phase 2 Arabic Newswire
Parallel Text (LDC2012T17) * GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18)
* GALE Phase 2 Arabic Web Parallel Text (LDC2013T01) *Source Data* Preparing the source
data involved four stages of work: data scouting, data harvesting, formatting and
data selection. Data scouting involved manually searching the web for suitable newsgroup
text. Data scouts were assigned particular topics and genres along with a production
target in order to focus their web search. Formal annotation guidelines and a customized
annotation toolkit helped data scouts to manage the search process and to track progress.
Data scouts logged their decisions about potentital text of interest (sites, threads
and posts) to a database. A nightly process queried the annotation database and harvested
all designated URLs. Whenever possible, the entire site was downloaded, not just the
individual thread or post located by the data scout. Once the text was downloaded,
its format was standardized (by running various scripts) so that the data could be
more easily integrated into downstream annotation processes. Original-format versions
of each document were also preserved. Typically, a new script was required for each
new domain name that was identified. After scripts were run, an optional manual process
corrected any remaining formatting problems. The selected documents were then reviewed
for content-suitability using a semi-automatic process. A statistical approach was
used to rank a document's relevance to a set of already-selected documents labeled
as "good." An annotator then reviewed the list of relevance-ranked documents and selected
those which were suitable for a particular annotation task or for annotation in general.
These newly-judged documents in turn provided additional input for the generation
of new ranked lists. Manual sentence unit/segment (SU) annotation was also performed
on a subset of files following LDC's Quick Rich Transcription specification. Three
types of end of sentence SU were identified: * statement SU * question SU * incomplete
SU *Translation* After files were selected, the files were reformatted into a human-readable
translation format, and the files were then assigned to professional translators for
careful translation. Translators followed LDC's GALE translation guidelines, which
describe the makeup of the translation team, the source data format, the translation
data format, best practices for translating certain linguistic features (such as names
and speech disfluencies) and quality control procedures applied to completed translations.
TDF Format All final data are presented in Tab Delimited Format (TDF). TDF is compatible
with other transcription formats, such as the Transcriber format and AG format maklng
it easy to process. Each line of a TDF file corresponds to a speech segment and contains
13 tab delimited fields: field data_type 1 file unicode 2 channel int 3 start float
4 end float 5 speaker unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript
unicode 9 section int 10 turn int 11 segment int 12 sectionType unicode 13 suType
unicode A source TDF file and its translation are the same except that the transcript
in the source TDF is replaced by its English translation. Some fields are inapplicable
to newsgroup text. Those include the channel, start time, end time and speaker dialect
fields. Those fields are either empty or contain values as a place holder. Encoding
All data are encoded in UTF8. *Sponsorship* This work was supported in part by the
Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or the policy
of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zakhary, Dalal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635049
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 969-572-383-651-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BioProp Version 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
BioProp Version 1.0 was developed by researchers at Academia Sinica, Taipei, Taiwan.
It consists of proposition bank-style annotations for approximately 500 English biomedical
journal abstracts. The source abstracts, annotated in accordance with Penn Treebank
II guidelines, are contained in the GENIA Treebank (GTB). The GTB was developed at
the Tsujii Laboratory at the University of Tokyo. The purpose of the GENIA Project
is to develop tools and resources for automatic information extraction of biomedical
information. One result of that work is the GENIA corpus, a collection of 2000 biomedical
journal abstracts containing semantic class annotation for biomedical terms, part-of-speech
(POS) tags and coreferences. The GTB is a subset of that corpuse. BioProp Version
1.0 adds a proposition bank to the GTB. Proposition Bank (PropBank) contains annotations
of predicate argument structures and semantic roles in a treebank schema in the newswire
domain. To construct BioProp Version 1.0, a semantic role labeling (SRL) system trained
on PropBank was used to annotate the GTB. SRL, also called shallow semantic parsing,
is a popular semantic analysis technique. In SRL, sentences are represented by one
or more predicate-argument structures (PAS), also known as propositions. Each PAS
is composed of a predicate (e.g., a verb) and several arguments (e.g., noun phrases)
that have different semantic roles, including main arguments such as agent and patient,
and adjunct arguments, such as time, manner and location. The term "argument" refers
to a syntactic constituent of the sentence related to the predicate, and the term
"semantic role" refers to the semantic relationship between a sentence's predicate
and argument. To suit the needs in the biomedical domain, the PropBank annotation
guidelines were modified to characterize semantic roles as components of biological
events. Specifically, thirty verbs were selected according to their frequency of use
or importance in biomedical texts. Since targets in information extraction are relations
of named entities, only sentences containing protein or gene names were used to count
each verb's frequency. Verbs of general usage were filtered out in order to keep the
focus on biomedical verbs. Some verbs that do not have a high frequency but play important
roles in describing biomedical relations, such as "phosphorylate" and "transactivate,"
were also selected. The BioProp annotation was based on Levin?s verb classes as defined
in the VerbNet lexicon. In VerbNet, the arguments of each verb are represented at
the semantic level, and thus have associated semantic roles. However, since some verbs
may have different usages in biomedical and newswire texts, it is necessary to customize
the framesets of biomedical verbs. After selecting the predicate verbs, a semi-automatic
method was used to annotate BioProp. The annotation process consisted of the following
steps: * Identification of predicate candidates * Automatic annotation of the biomedical
semantic roles using newswire SRL system * Transformation of automatic tagging results
into WordFreak format * Review by human annotators *Data* BioProp Version 1.0 consists
of approximately 150,000 words. Each line in the corpus provides a PAS annotation
that can be mapped to a sentence in the GTB.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Medical sciences
- Form subdivision:
Periodicals
- General subdivision:
Abstracting and indexing
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic indexing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hsu, Wen-Lian
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635081
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 415-470-503-471-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST MetricsMATR is a series of research challenge events for machine translation
(MT) metrology, promoting the development of innovative, even revolutionary, MT metrics
that correlate highly with human assessments of MT quality. In this program, participants
submit their metrics to the National Institute of Standards and Technology (NIST).
NIST runs those metrics on certain held-back test data for which it has human assessments
measuring quality and then calculates correlations between the automatic metric scores
and the human assessments. This release contains the development data received by
participants in NIST Metrics for Machine Translation 2008 Evaluation (MetricsMATR08).
Specifically, this corpus is comprised of a subset of the materials used in the NIST
Open MT06 evaluation and includes human reference translations, system translations,
and human assessments of adequacy and preference. The source data consists of twenty-five
Arabic language newswire documents with a total of 249 segments. The data in each
segment includes four human reference translations in English and system translations
from eight different MT06 machine translation systems. In addition to the data and
reference translations, this release inlcudes software tools for evaluation and reporting
and documentation describing how the human assessments were obtained and how they
are represented in the data. The evaluation plan contains further information and
rules on the use of this data. The MetricsMATR program seeks to overcome several drawbacks
to the methods employed for the evaluation of MT technology. Currently, automatic
metrics have not yet proved able to predict the usefulness and reliability of MT technologies
with confidence. Nor have automatic metrics demonstrated that they are meaningful
in target languages other than English. Human assessments, however, are expensive,
slow, subjective and difficult to standardize. These problems, and the need to overcome
them through the development of improved automatic (or even semi-automatic) metrics,
have been a constant point of discussion at past NIST MT evaluation events. MetricsMATR
aims to provide a platform to address these shortcomings. Specifically, the goals
of MetricsMATR are: * To inform other MT technology evaluation campaigns and conferences
with regard to improved metrology. * To establish an infrastructure that encourages
the development of innovative metrics. * To build a diverse community that will bring
new perspectives to MT metrology research. * To provide a forum for MT metrology discussion
and for establishing future directions of MT metrology. *Data* The MetricsMATR08 development
data set released here is reflective of the test data set only to a degree; the evaluation
data set contains more varied data -- from more genres, more source languages, more
systems and different evaluations -- than this development data set. There are also
more types of human assessments for the test data. The MetricsMATR08 test data remains
unseen to allow for repeated use as test data. The software used for obtaining the
human judgments included in this data set is the same software used for the NIST Open
MT08 human assessments. It includes a description of the adequacy and preference assessment
tasks and the instructions given to the judges. All segments assessed were judged
by two independent judges. Adequacy judgments were performed for all segments of each
document. Preference judgments were performed for the first four segments of each
document such that full pair-wise comparisons between all eight MT systems were obtained.
All judgments were performed against only one reference translation. The score represents
an adjudicated score over the two individual judgments. The official results of MetricsMATR08
on the test data for the metrics submitted to MetricsMATR08 are publicly available.
NIST performed the same analyses on the MetricsMATR08 development data after the evaluation.
These results are not publicly available, but will likely be available on request
in the future by contacting mt_poc@nist.gov.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Przybocki, Mark
ADDED ENTRY--PERSONAL NAME
- Personal name:
Peterson, Kay
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bronsart, Sébastien
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u jpn d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635103
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 380-138-081-238-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
jpn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Japanese Web N-gram Version 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Japanese Web N-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2009T08
and isbn 1-58563-510-3, was created by Google Inc. It consists of Japanese "word"
n-grams and their observed frequency counts generated from over 255 billion tokens
of text. The length of the n-grams ranges from unigrams to seven-grams. The n-grams
were extracted from publicly accessible web pages that were crawled by Google in July
2007. This data set contains only n-grams that appear at least 20 times in the processed
sentences. Less frequent n-grams were simply discarded. Those web pages requiring
user authentication, pages containing "noarchive" or "noindex" meta tags, and pages
under other special restrictions were excluded from the final release. While the aim
was to process only Japanese pages, the corpus may contain some pages in other languages
due to language detection errors. This dataset will be useful for research in areas
such as statistical machine translation, language modeling and speech recognition,
among others. *Data* Before the n-grams were collected, the web pages were converted
into UTF-8 encoding, normalized into Unicode Normalization Form KC (see below), and
split into sentences. Ill-formed sentences were filtered out, and the remaining sentences
were segmented into "words". All strings were normalized into Unicode Normalization
Form KC (NFKC), which is described in http://www.unicode.org/unicode/reports/tr15/.
Japanese strings were normalized according to the following rules: * Full-width letters/digits
were converted to ASCII letters/digits * Half-width katakana were converted to full-width
katakana * Glyphs for Roman digits were converted to ASCII characters * Certain Japanese-specific
symbols were converted The vocabulary was restricted to "words" that appeared at least
50 times in the processed sentences. Statistical information about the corpus is set
forth in the following table: Data size The total compressed data size is about 26GB.
Number of tokens: 255,198,240,937 Number of sentences: 20,036,793,177 Number of unique
unigrams: 2,565,424 Number of unique bigrams: 80,513,289 Number of unique trigrams:
394,482,216 Number of unique 4-grams: 707,787,333 Number of unique 5-grams: 776,378,943
Number of unique 6-grams: 688,782,933 Number of unique 7-grams: 570,204,252
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Japanese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Japanese language
- Form subdivision:
Databases.
- General subdivision:
Word frequency
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Web sites
- Geographic subdivision:
Japan.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kudo, Taku
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kazawa, Hideto
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635030
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 150-170-938-041-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2, Linguistic Data
Consortium (LDC) catalog number LDC2009T06 andISBN 1-58563-503-0, contains transcripts
and English translations of 24 hours of Chinese broadcast conversation programming
from China Central TV (CCTV), Phoenix TV and Voice of America (VOA). It does not contain
the audio files from which the transcripts and translations were generated. This release,
along with other corpora, was used as training data in Phase 1 (year 1) of the DARPA-funded
GALE program. GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 was
released in January 2009. *Source Data* A total of 24 hours of Chinese broadcast conversation
programming was selected from three sources: CCTV (a broadcaster from Mainland China),
Phoenix TV (a Hong Kong-based satellite TV station) and VOA. The transcripts and translations
represent recordings of seven different programs. A manual selection procedure was
used to choose data appropriate for the GALE program, namely, conversation (talk)
programs focusing on current events. Stories on topics such as sports, entertainment
and business were excluded from the data set. The following table is a summary of
the files included in this release. Source Program Epoch (YYYY.MM) #hours #characters
CCTV Today's Focus 2006.01 2.1 36,398 Phoenix TV Asian Journal 2005.09 - 2005.11 2.5
27,876 Behind the Headlines 2006.01 1.5 7,170 A Date with Lu Yu 2005.10 - 2006.01
10.9 129,782 News Hacker 2005.10 - 2005.11 3.0 43,070 Passion on China 2005.10 2.0
21,482 VOA News08 2005.03 2.0 28,621 *Transcription* The selected audio snippets were
carefully transcribed by LDC annotators and professional transcription agencies following
LDC's Quick Rich Transcription specification. Manual sentence units/segments (SU)
annotation was also performed as part of the transcription task. Three types of end
of sentence SU are identified: * statement SU * question SU * incomplete SU *Translation*
After transcription and SU annotation, files were reformatted into a human-readable
translation format and assigned to professional translators for careful translation.
Translators followed LDC's GALE Translation guidelines which describe the makeup of
the translation team, the source data format, the translation data format, best practices
for translating certain linguistic features (such as names and speech disfluencies)
and quality control procedures applied to completed translations. TDF Format All final
data are in Tab Delimited Format (TDF). TDF is compatible with other transcription
formats, such as the Transcriber format and AG format, and it is easy to process.
Each line of a TDF file corresponds to a speech segment and contains 13 tab delimited
fields: field data_type 1 file unicode 2 channel int 3 start float 4 end float 5 speaker
unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript unicode 9 section
int 10 turn int 11 segment int 12 sectionType unicode 13 suType unicode A source TDF
file and its translation are the same except that the transcript in the source TDF
is replaced by its English translation. Encoding All data are encoded in UTF8. *Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency,
GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not
necessarily reflect the position or the policy of the Government, and no official
endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635111
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 369-443-379-033-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Unified Linguistic Annotation Text Collection
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Unified Linguistic Annotation Text Collection, Linguistic Data Consortium (LDC)
catalog number LDC2009T07 and isbn 1-58563-511-1, consists of two separate corpora:
The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation
Training/DevTest (LDC2009T11). Most recent annotation efforts for language have focused
on small pieces of the larger problem of semantic annotation rather than producing
a single unified representation. The Unified Linguistic Annotation (ULA) project,
sponsored by the National Science Foundation, seeks to integrate into one framework
different layers of annotation (e.g., semantics, discourse, temporal, opinions) using
various existing resources, including PropBank, NomBank, TimeBank, Penn Discourse
Treebank and coreference and opinion annotations. The project represents a concerted
effort of researchers from several institutions to develop a large word corpus with
balanced and annotated data. The ULA Text Collection is provided as a resource for
the ULA effort. It consists of two datasets, the Language Understanding Annotation
Corpus from the Johns Hopkins Center of Excellence in Human Language Technology and
ACE Reflex Entity Translation Training Dev/Test developed by LDC. The Language Understanding
Annotation Corpus (LDC2009T10). The Language Understanding Annotation Corpus consists
of over 9000 words of English text (6949 words) and Arabic text (2183 words) annotated
for committed belief, event and entity coreference, dialog acts and temporal relations.
The materials were chosen from various sources to represent "informal input," that
is, text that contains colloquial forms. The documents in the corpus include excerpts
from newswire stories, telephone conversation transcripts, emails, contracts and written
instructions. REFLEX Entity Translation Training/DevTest (LDC2009T11). REFLEX Entity
Translation Training/DevTest is the complete set of training data and development
test data for the 2007 REFLEX Entity Translation evaluation sponsored by the National
Institute of Standards and Technology (NIST). It contains approximately 67.5k words
of newswire and weblog text for each of English, Chinese and Arabic (or approximately22.5k
words in each language) translated ito each of the other two languages. The data is
annotated for entities and TIMEX2 extents and normalization.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Standard Arabic, and Arabic. Documentation in
English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635138
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 775-964-514-342-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Language Understanding Annotation Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Language Understanding Annotation Corpus, Linguistic Data Consortium (LDC) catalog
number LDC2009T10 and isbn 1-58563-513-8, emerged from a series of interdisciplinary
meetings on semantics and pragmatics hosted by the Human Language Technology Center
of Excellence at Johns Hopkins University. The participants were researchers from
BBN Technologies, Carnegie Mellon University and Columbia University who were developing
representations of text semantics, machine translation and summarization systems.
The resulting corpus contains over 9000 words of English text (6949 words) and Arabic
text (2183 words) annotated for committed belief, event and entity coreference, dialog
acts and temporal relations. The source materials were chosen from various genres
to represent "informal input," that is, text that contains colloquial forms. The documents
in the corpus include excerpts from newswire stories, telephone conversation transcripts,
emails, contracts and written instructions. The problem was modeled as an extended
exercise in extracting information elements from a "document" (that is, from discrete
language records in written or spoken forms). The goal was to answer two broad questions:
* What are the elements of knowledge that can be derived from a document? * Can the
representation, and hence, the annotation, be laid out in terms of iterative layers,
the accumulation of which would represent the sum of the knowledge? The annotations
attempted to resolve these questions in the following ways: * Belief/Opinion/Confidence.
Committed belief annotation distinguishes between statements which assert belief or
opinion, those which contain speculation, and statements which convey facts or otherwise
do not convey belief. The goal is to be able to determine automatically from a given
text what beliefs can be ascribed to the author and with what strength the author
holds those beliefs. * Dialog Acts. Dialog act annotation seeks to determine the forward
and backward links between pairs of dialog acts. * Coreference (entities and events).
Event coreferences indicate which events are related to other events at the document
level. Entity relations within these related events provide further information about
e.g., the main actors, targets and causes of the events. * Temporal relations. Temporal
annotations mark the temporal relationship between the different events and time anchors
mentioned in a document, that is, it highlights what the text is saying about the
time line of time-mentions.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Diab, Mona
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dorr, Bonnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Levin, Lori
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mitamura, Teruko
ADDED ENTRY--PERSONAL NAME
- Personal name:
Passonneau, Rebecca
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rambow, Owen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ramshaw, Lance
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635146
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 364-559-117-639-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
REFLEX Entity Translation Training/DevTest
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
REFLEX Entity Translation Training/DevTest, Linguistic Data Consortium (LDC) catalog
number LDC2009T11 and isbn 1-58563-514-6, was developed by the LDC for the Automatic
Contact Extraction (ACE) program. This release constitutes the complete set of training
data and development test data for the 2007 REFLEX Entity Translation evaluation sponsored
by the National Institute of Standards and Technology (NIST) and consists of approximately
67.5k words of newswire and weblog text for each of three languages: English, Chinese
and Arabic. The data set is made up of 22.5k words of English data, 22.5k words of
Chinese data, and 22.5k words of Arabic data translated into each of the other two
languages and annotated for entities and TIMEX2 extents and normalization. Entity
Annotation. The annotations identify seven types of entities: Person, Organization,
Location, Facility, Weapon, Vehicle and GeoPolitical Entity. Each type is further
divided into subtypes (for instance, Person subtypes include Individual, Group and
Indefinite). Annotators tagged all mentions of each entity within a document, whether
named, nominal or pronominal. For every mention, the annotator identified the maximal
extent of the string that represents the entity and labeled the head of each mention.
Nested mentions were also captured. Each entity was classified according to its type
and subtype. Each entity mention was further tagged according to its class such as
specific, generic, attributive, negatively quantified or under specified. Annotators
also reviewed the entire document to group mentions of the same entity together; they
also labeled cases of metonymy, where the name of one entity is used to refer to another
entity (or entities) related to it. TIMEX2 Annotation. TIMEX2 annotation of events
and temporal relations fulfills two objectives. The first is the interpretation of
expressions that refer to time. Such expressions tell when something happened, or
how long something lasted, or how often something occurs. Such expressions also often
require knowledge of the temporal context in order to truly understand them. A second
objective is the normalization of temporal expressions. This facilitates interoperability
between systems. Problems occur, for example, when a programmer in France encodes
"October sixteenth 1962" as "1962.10.16" and one in the U.S. encodes it as "10/16/1962".
It will appear as if two different dates are being referenced. The standards presented
here require that the same meaning is always encoded in the same way. *Sample* Please
use this link for a sample.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Standard Arabic, and Arabic. Documentation in
English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Medero, Julie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635197
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 014-122-305-405-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Czech Broadcast Conversation Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Czech Broadcast Conversation Speech was prepared by researchers at the University
of West Bohemia, Pilsen, Czech Republic, and consists of 40 hours of speech recorded
from Czech Radio 1 in 2003. Transcripts corresponding to the audio files in this corpus
are provided in Czech Broadcast Conversation MDE Transcripts (LDC2009T20). These corpora
join LDC's other Czech broadcast data sets: Czech Broadcast News Speech (LDC2004S01),
Czech Broadcast News Transcripts (LDC2004T01), Voice of America (VOA) Czech Broadcast
News Audio (LDC2000S89), and Voice of America (VOA) Czech Broadcast News Transcripts
(LDC2000T53). Czech Broadcast Conversation Speech consists of 72 single channel recordings
of Radioforum, a live talk program broadcast by Czech Radio 1 (CRo1) every weekday
evening. Its format consists of invited guests (most often politicians) spontaneously
answering topical questions posed by one or two interviewers. The number of interviewees
in a single program varies from one to three, but typically, one interviewer and two
interviewees appear in the program. The material includes passages of interactive
dialogue, but longer stretches of monologue-like speech comprise the majority of the
collected data. Radioforum also has an interactive segment where listeners call the
studio and ask their own questions. That telephony speech was not transcribed in the
current release. *Data* Individual recordings range from 27 minutes to 36 minutes
each. The recordings were collected during the period from February 12, 2003 through
June 26, 2003. The signal is mono, sampled at 22.05 kHZ with 16-bit resolution, stored
in Windows PCM waveform format. The names of the audio files refer to the broadcast
date (rfYYMMDD.wav). The table below contains details about the audio files and the
transcripts: Number of shows 72 Number of word tokens 292.6k Number of unique words
30.5k Duration of transcribed speech 33.0h Total number of speakers 128 Male speakers
108 Female speakers 20
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Czech. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kolar, Jachym
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Psutka, Josef
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635057
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 757-340-046-619-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2008 CoNLL Shared Task Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2008 CoNLL Shared Task Data, Linguistic Data Consortium (LDC) catalog number LDC2009T12
and isbn 1-58563-505-7, contains the the trial corpus, training corpus, development
and test data for the 2008 CoNLL (Conference on Computational Natural Language Learning)
Shared Task Evaluation. The 2008 Shared Task developed syntactic dependency annotations,
including information such as named-entity boundaries and the semantic dependencies
model roles of both verbal and nominal predicates. The materials in the Shared Task
data consist of excerpts from the following corpora: Treebank-3 LDC99T42, BBN Pronoun
Coreference and Entity Type Corpus LDC2005T33, Proposition Bank I LDC2004T14 (PropBank)
and NomBank v 1.0 LDC2008T23. The Conference on Computational Natural Language Learning
(CoNLL) is accompanied every year by a shared task intended to promote natural language
processing applications and evaluate them in a standard setting. The 2004 and 2005
CoNLL shared tasks were dedicated to semantic role labeling (SRL) in a monolingual
setting (English). In 2006 and 2007, the shared tasks were devoted to the parsing
of syntactic dependencies and used corpora from up to thirteen languages. The 2008
shared task employed a unified dependency-based formalism and merged the task of syntactic
dependency parsing and the task of identifying semantic arguments and labeling them
with semantic roles. LDC has also released the following CoNLL Shared Task data sets:
* 2006 CoNLL Shared Task - Ten Languages (LDC2015T11) * 2006 CoNLL Shared Task - Arabic
& Czech (LDC2015T12) * 2009 CoNLL Shared Task Part 1 (LDC2012T03) * 2009 CoNLL Shared
Task Part 2 (LDC2012T04) * 2015-2016 CoNLL Shared Task (LDC2017T13) *Data* The 2008
shared task was divided into three subtasks: * parsing syntactic dependencies * identification
and disambiguation of semantic predicates * identification of arguments and assignment
of semantic roles for each predicate Several objectives were addressed in this shared
task: * SRL was performed and evaluated using a dependency-based representation for
both syntactic and semantic dependencies. While SRL on top of a dependency treebank
has been addressed before, the approach of the 2008 Shared Task was characterized
by the following novelties: * The constituent-to-dependency conversion strategy transformed
all annotated semantic arguments in PropBank and NomBank v 1.0, not just a subset;
* The annotations addressed propositions centered around both verbal (PropBank) and
nominal (NomBank) predicates. * Based on the observation that a richer set of syntactic
dependencies improves semantic processing, the syntactic dependencies modeled are
more complex than the ones used in the previous CoNLL shared tasks. For example, the
corpus includes apposition links, dependencies derived from named entity (NE) structures,
and better modeling of long-distance grammatical relations. * A practical framework
is provided for the joint learning of syntactic and semantic dependencies. Due to
the complexity of the 2008 shared task, only a single language, English, was used.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Surdeanu, Mihai
ADDED ENTRY--PERSONAL NAME
- Personal name:
Johansson, Richard
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marquez, Lluis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meyers, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Nivre, Joakim
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u tam d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635073
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009L01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 140-507-539-817-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
An English Dictionary of the Tamil Verb Second Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009L01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
An English Dictionary of the Tamil Verb represents over twenty-five years of work
led by Harold F. Schiffman, Professor, emeritus, of Dravidian Linguistics and Culture
at the University of Pennsylvania's Department of South Asia Studies. It contains
translations for 6597 English verbs and defines 9716 Tamil verbs. This release presents
the dictionary in two formats: Adobe PDF and XML. The PDF format displays the dictionary
in a human readable form. The XML version is a purely electronic form which, while
readable by humans, is intended mainly for application development and the creation
of searchable electronic databases. In the electronic XML version each entry contains
the following: the English entry or head word; the Tamil equivalent (in Tamil script
and transliteration); the verb class and transitivity specification; the spoken Tamil
pronunciation (audio files in mp3 format); the English definition(s); additional Tamil
entries (if applicable); example sentences or phrases in Literary Tamil, Spoken Tamil
(with a corresponding audio file in .mp3 format) and an English translation; and Tamil
synonyms or near-synonyms, where appropriate. Some foods referenced in the example
sentences are illustrated in html files that include detailed description of each
dish. It is expected that the dictionary will be useful for Tamil learners, scholars
and others interested in the Tamil language. *What's New in the Second Edition?* *
Errors in the Tamil text and the roman transliteration have been corrected. * Audio
files have been updated and corrected and missing files have been added. * A brand
new search and browse application that can access the audio has been included in this
edition. This application can be accessed from the tools directory. * The XML structure
has been modified to normalize the presentation of synonyms. *The Tamil Verb* Tamil
is an official language of India, Singapore and Sri Lanka and has roughly 66 million
native speakers worldwide. Most Tamil speakers live in the Tamil Nadu State and northeastern
Sri Lanka, but the extended diaspora includes Malaysia, Mauritius and Singapore. Tamil
is also a Classical Language of India. A member of the Dravidian language family,
it boasts a rich literary tradition stretching back over 2200 years. Tamil is a diglossic
language, meaning that it consists of at least two distinct forms. Spoken Tamil (ST)
refers to the numerous vernacular dialects, and Literary Tamil (LT) refers to the
form of the language used in print and most broadcast news media. The dialects of
Spoken Tamil fall along regional divisions and along caste lines; there is no widely
adopted standard for ST, although one seems to be emerging. Educated Tamil speakers,
on the other hand, generally use LT with little variation in written communication.
It appears, however, that a common dialect of ST may be emerging as a result of a
growing broadcast media and increased rates of higher education. That dialect resembles
the upper caste (non-Brahman) dialects spoken in the urban centers of Tamil Nadu and
borrows verbs from LT. The spoken examples in An English Dictionary of the Tamil Verb
reflect this emerging common dialect. Tamil is also an agglutinative language, meaning
that it constructs verbs by appending inflections in the form of suffixes onto basic
verb-stem morphemes. These inflections primarily denote tense, aspect, voice and mood.
As far as voice is concerned, however, though LT may mark a verb as passive on occasion,
ST rarely makes this distinction. These suffixes mark whether a verb is transitive
or intransitive, that is, they indicate whether the subject is acted on by the verb
or whether the subject is the actor. Mood is implied by verb tense, but may also be
provided by verbal auxiliaries expressing various degrees of probability, futurity,
ability, and their negatives. Tamil also may add suffixes that mark aspectual distinctions,
such as whether an action is considered to be perfective ('complete' and/or 'definite')
or whether it is ongoing or imperfective ('continuous' or 'durative'), as well as
other distinctions. Aspect is a category that is undergoing increasing grammaticalization
and is therefore more usual in ST than in LT. As is common with this process, aspectual
distinctions are 'speaker-centered', i.e. they provide personal observations (some
analysts have referred to this as 'attitude' or 'point of view', which is of course
what the word 'aspect' originally means) which describe the speaker's frame of mind
concerning the event depicted in the sentence -- whether it is perceived to be beneficial
or detrimental, positive or negative, voluntary or involuntary, etc. Aspectual distinctions
vary widely among dialects; both because of the variability of the grammaticalization
process and for historical reasons, and Tamil speakers can code-switch among different
dialects depending on context and audience. In addition, ST and LT treat the grammar
of verbs differently. Finding exact equivalents between English and Tamil verbs is
very difficult as a result of Tamil's diglossic nature and because of the difficulty
of mapping English aspectual distinctions onto Tamil aspectual categories. An English
Dictionary of the Tamil Verb seeks to meet needs not currently addressed by existing
English-Tamil dictionaries. The main goal of this dictionary is to get an English-knowing
user to a Tamil verb, irrespective of whether he or she begins with an English verb
or some other item, such as an adjective; this is because what may be a verb in Tamil
may in fact not be a verb in English, and vice versa. Since the number of English
entries is limited (slightly less than 10,000) there may not be main entries for certain
low-frequency items like 'pounce' but this item does appear as a synonym for 'jump,
leap', and some other verbs, so searching for 'pounce' will get the user to a Tamil
verb via the synonym field. The main goal is therefore to specifically concentrate
on supplying the kinds of information lacking in all previous attempts to capture
the equivalencies between English and Tamil. This dictionary addresses the following
problem areas: * Verb classes: English-Tamil dictionaries, both current and previously
extant, do not provide the user with any information about the morphological class
of the Tamil verb, nor do they give information as to whether a verb is transitive
or intransitive. This kind of information is readily available in Tamil-English dictionaries,
but not in English-Tamil dictionaries. * Spoken Tamil: No English-Tamil dictionary
gives information about the spoken or colloquial pronunciation of Tamil. Nor do they
indicate whether a verb found in Literary Tamil is also used in Spoken Tamil. Information
about ST is harder to get than Tamil in any other form. No electronic databases exist
for ST, and many speakers of Tamil do not consider ST to be worth devoting any attention
to. For non-Tamil speakers attempting to learn Tamil, however, ST is necessary for
day-to-day functioning in a Tamil environment, and this dictionary is intended to
meet their needs, not primarily the needs of native speakers. * Example Sentences:
Currently extant English-Tamil dictionaries give few if any example sentences that
illustrate the morphological and/or syntactic frames in which verbs occur. This is
particularly important for ST, since it is morphologically and syntactically more
complex than LT, especially in the verb phrase. * Modern Usage: Most extant English-Tamil
dictionaries are now seriously out of date. Their compilers have often simply replicated
the data found in previous dictionaries, with the result that the English represented
in them is that of previous centuries. The Tamil forms given are also lacking in modernity,
but for other reasons. * Syntactic Complexity of the Verb Phrase: Because the Tamil
verb is morphologically complex, and the verb phrase therefore syntactically very
complex, the authors decided to focus only on the Tamil verb. Tamil nouns are, in
contrast, morphologically simple, and the noun phrase is remarkably uncomplicated.
Tamil nouns have no gender distinctions (except where there is biological gender),
no agreement, and adjectives are not inflected for number or gender. The Tamil Nadu
government has spent much time and energy creating lexica and glossaries for various
modern usages for Tamil, but these have mainly generated new nominal terminology,
not verbs. This is partly because LT cannot borrow verbs easily, i.e. it cannot take
a 'foreign' word and add Tamil morphological material to it, such as tense marking
and person-number-gender marking, which all Tamil finite verbs must have. ST, on the
other hand, has no problem with borrowings or other innovative word-formation devices.
Spoken Tamil, however is not deemed worthy of being used in such contexts. * Sound
Files: Finally, this dictionary, because of its electronic format, contains sound
recordings of every Spoken Tamil example sentence, which of course could not be a
part of any previous dictionaries printed on paper. Since most readers of LT already
possess a familiarity with some form of ST, sound files for verbs in this form are
not provided. The spoken verbs in this dictionary are drawn from the emerging common
dialect of Tamil.
LANGUAGE NOTE
- Language note:
Content in Tamil and English. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Tamil language
- Form subdivision:
Dictionaries.
- General subdivision:
Verb
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Tamil language
- Form subdivision:
Dictionaries
- General subdivision:
English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schiffman, Harold
ADDED ENTRY--PERSONAL NAME
- Personal name:
Renganathan, Vasu
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009L01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635154
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 874-388-164-791-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English Gigaword Fourth Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for English Gigaword Fourth Edition, Linguistic Data
Consortium (LDC) catalog number LDC2009T13 and isbn 1-58563-515-4. English Gigaword,
now being released in its fourth edition, is a comprehensive archive of newswire text
data that has been acquired over several years by the LDC at the University of Pennsylvania.
The fourth edition includes all of the contents in English Gigawaord Third Edition
(LDC2007T07) plus new data covering the 24-month period of January 2007 through December
2008. The six distinct international sources of English newswire included in this
edition are the following: * Agence France-Presse, English Service (afp_eng) * Associated
Press Worldstream, English Service (apw_eng) * Central News Agency of Taiwan, English
Service (cna_eng) * Los Angeles Times/Washington Post Newswire Service (ltw_eng) *
New York Times Newswire Service (nyt_eng) * Xinhua News Agency, English Service (xin_eng)
*New in the Fourth Edition* * Articles with significant Spanish language content have
now been identified and documented. * Markup has been simplified and made consistent
throughout the corpus. * Information structure has been simplified. * Character entities
have been simplified.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Parker, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635200
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 780-971-266-156-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Czech Broadcast Conversation MDE Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Czech Broadcast Conversation MDE Transcripts, Linguistic Data Consortium (LDC) catalog
number LDC2009T20 and ISBN 1-58563-520-0, was prepared by researchers at the University
of West Bohemia, Pilsen, Czech Republic, and consists of approximately 33 hours of
transcribed speech from Radioforum, a talk show broadcast on Czech Radio 1. The audio
files corresponding to the transcripts in this corpus are contained in Czech Broadcast
Conversation Speech (LDC2009S02). These corpora join LDC's other Czech broadcast data
sets: Czech Broadcast News Speech (LDC2004S01), Czech Broadcast News Transcripts (LDC2004T01),
Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89), and Voice of America
(VOA) Czech Broadcast News Transcripts (LDC2000T53). Czech Broadcast Conversation
Speech consists of 72 single channel recordings of Radioforum, a live talk program
broadcast by Czech Radio 1 (CRo1) every weekday evening. A total of 40 hours of recordings
were collected during the period from February 12, 2003 through June 26, 2003. Individual
recordings range from 27 minutes to 36 minutes each. Radioforum's format consists
of invited guests (most often politicians) spontaneously answering topical questions
posed by one or two interviewers. The number of interviewees in a single program varies
from one to three, but typically, one interviewer and two interviewees appear in the
program. The material includes passages of interactive dialogue, but longer stretches
of monologue-like speech comprise the majority of the collected data. Radioforum also
has an interactive segment where listeners call the studio and ask their own questions.
That telephony speech was not transcribed in the current release. Czech Broadcast
Conversation MDE Transcripts was created to extend Metadata Extraction (MDE) research
to conversational Czech. The goal of MDE is to take raw speech recognition output
and refine it into forms that are of more use to humans and to downstream automatic
processes. In simple terms, this means the creation of automatic transcripts that
are maximally readable. This readability might be achieved in a number of ways: removing
non-content words like filled pauses and discourse markers from the text; removing
sections of disfluent speech; and creating boundaries between natural breakpoints
in the flow of speech so that each sentence or other meaningful unit of speech might
be presented on a separate line within the resulting transcript. Natural capitalization,
punctuation and standardized spelling, plus sensible conventions for representing
speaker turns and identity are further elements in the readable transcript. The transcripts
and annotations in this corpus are stored in three different formats: TRS (Transcriber
- http://trans.sourceforge.net), QAn (Quick Annotator - http://www.mde.zcu.cz/qan.html),
and RTTM. TRS represents a standard speech transcript. QAn and RTTM contain essentially
identical information about structural metadata (MDE); the main difference between
them is formatting. Character encoding in all files is ISO-8859-2. All filenames have
the form rfYYMMDD.format where "rf" stands for Radioforum, the following six digits
indicate the date of broadcast, and the extension ".format" corresponds to the data
format of the particular file ".trs", ".qan", or ".rttm". More information can be
found on the website Structural Metadata Annotation for Czech. *Data* The radio programs
recorded for this corpus were transcribed with two purposes. First, in order to produce
precise time-aligned verbatim transcripts of the audio recordings, manual transcripts
were created using guidelines based on those employed in Czech Broadcast News Transcripts
(LDC2004T01). Second, the transcripts were annotated wiith MDE markup to provide structural
information about the conversations. Manual time-aligned verbatim transcription The
original guidelines for time-aligned verbatim transcription used for the Czech broadcast
news data were adjusted to better accommodate specifics of the recorded broadcast
coversation. Those revised guidelines instructed annotators how to deal with the following
phenomena, among others: * Speaker turns: a corresponding time stamp and speaker ID
are inserted every time there is a speaker change in the audio. * Turn-internal breakpoints:
to break up long turns, breakpoints roughly corresponding to 'sentence' boundaries
within a speaker turn are inserted. * Overlapping speech: an overlapping speech region
is recognized when more than one speaker talks simultaneously; within this region,
each speaker's speech is transcribed separately (if intelligible). * Background noises:
[NOISE] tags are used to mark noticeable background noises. * Speaker noises: speaker-produced
noises are identified with one of the following tags: [BREATH], [COUGH], [LAUGH],
[LIP-SMACK]. * Filled pauses: filled pauses produced by a speaker to indicate hesitation
or to maintain control of a conversation are transcribed either as [EE-HESITATION]
or as [MM-HESITATION], based on their pronunciation. * Interjections: certain interjections
typically used as back channels or to express speaker's agreement or disagreement
are transcribed using the [HM] (agreement) and [MH] (disagreement) tags. * Unintelligible
speech: regions of unintelligible speech are marked with a special symbol. * Numbers:
all numerals are transcribed as complete words. * Mispronounced words: mispronounced
words (reading errors, slips of the tongue) are transcribed in the spelling corresponding
to their pronunciation in the audio (i.e., the incorrect pronunciation is represented)
and marked with a special symbol. * Word fragments: the pronounced part of the word
is transcribed and a single dash is used to indicate point at which word was broken
on. * Punctuation: standard punctuation (limited to commas, periods, and question
marks) is used to enhance transcript readability. Because the verbatim transcripts
were created by a large number of annotators, they were manually revised for maximum
correctness and consistency. MDE annotation MDE is an annotation task which annotates
Edit Disfluencies (repetitions, revisions, restarts and complex disfluencies), Fillers
(including, e.g., filled pauses and discourse markers) and SUs, or syntactic/semantic
units. Originally, the structural MDE annotation standard was defined for English.
When developing structural metadata annotation guidelines for Czech, the guidelines
developed by LDC for English were followed to the extent possible. Lanaguage-dependent
modifications were made based on the description of the syntax of Czech compound and
complex sentences. MDE Annotation marks the following phenomena: * Edit Disfluencies:
Edit disfluencies, or speech repairs, occur when speakers correct or alter their utterances
or abandon them entirely and start over. * Fillers: While the term filler has traditionally
been synonymous with filled pause, SimpleMDE uses the term to encompass a broad set
of vocalized space-fillers: filled pauses (FPs), discourse markers (DMs), explicit
editing terms (EETs) and asides/parentheticals (A/Ps). * Sentence-like units: One
of the goals of MDE annotation is the identification of all units within the discourse
that function to express a complete thought or idea on the part of the speaker.Within
MDE these elements are called SUs (Syntactic, Semantic or Slash Units). Corpus Statistics
The table below contains details about the audio files and the transcripts: Number
of shows 72 Number of word tokens 292.6k Number of unique words 30.5k Duration of
transcribed speech 33.0h Total number of speakers 128 Male speakers 108 Female speakers
20
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Czech. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kolar, Jachym
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635162
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 247-043-830-464-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Tagged Chinese Gigaword Version 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Tagged Chinese Gigaword Version 2.0, created by scholars at Academia Sinica, Taipei,
Taiwan, is a part-of-speech tagged version of LDC's Chinese Gigaword Second Edition
(LDC2005T14). Like the original release, Version 2.0 contains all of the data in Chinese
Gigaword Second Edition -- from Central News Agency, Xinhua News Agency and Lianhe
Zaobao -- annotated with full part of speech tags. In addtion, this new release removes
residual noises in the original and improves tagging accuracy by incorporating lexica
of unknown words. The changes represented in Version 2.0 include the following: *
A single-width space is used consistently between two segmented words. * The position
of the newline character remains fixed, better reflecting the source files from Chinese
Gigaword Second Edition (LDC2005T14). * The original coding of partial Latin letters
or Arabic numerals is preserved. * 1,192 documents from Central News Agency (Taiwan)
and 13 documents from Xinhua News Agency that were missing from the first publication
are included. * A set of heuristics for building out-of-vocabulary dictionaries to
improve annotation quality of very large corpora is incorporated. Documents in the
corpus were assigned one of the following categories: * story: This type of DOC represents
a coherent report on a particular topic or event, consisting of paragraphs and full
sentences. * multi: This type of DOC contains a series of unrelated "blurbs," each
of which briefly describes a particular topic or event; examples include "summaries
of today's news," "news briefs in ..." (some general area like finance or sports),
and so on. * advis: These are DOCs which the news service addresses to news editors;
they are not intended for publication to the "end users." * other: These DOCs clearly
do not fall into any of the above types; they include items such as lists of sports
scores, stock prices, temperatures around the world, and so on. *Data* Basic statistics
of data from each source are summarized below. Source No. Files Compressed Size(MB)
Total Size(MB) No. Words(thousands) No. Documents CNA_CMN 168 1520 6136 501456 1769953
XIN_CMN 168 898 3755 311660 992261 ZBN_CMN 10 55 214 18632 41418 TOTAL 346 2473 10105
831748 2803632 The POS tags and their corresponding explanations are listed below:
Tag Explanation_Chinese Explantation_English A 非謂形容詞 Non-predicative adjective Caa
對等連接詞,如:和、跟 Conjunctive conjunction Cab 連接詞,如:等等 Conjunction, e.g.deng3deng3 Cba 連接詞,如:的話
Conjunction, e.g.de5hua4 Cbb 關聯連接詞 Correlative Conjunction D 副詞 Adverb Da 數量副詞 Quantitative
Adverb DE 的, 之, 得, 地 Particle DE and its functional equivalents Dfa 動詞前程度副詞 Pre-verbal
Adverb of degree Dfb 動詞後程度副詞 Post-verbal Adverb of degree Di 時態標記 Aspectual Adverb
Dk 句副詞 Sentential Adverb FW 外文標記 Foreign Word I 感嘆詞 Interjection Na 普通名詞 Common Noun
Nb 專有名稱 Proper Noun Nc 地方詞 Place Noun Ncd 位置詞 Localizer Nd 時間詞 Time Noun Nep 指代定詞
Demonstrative Determinatives Neqa 數量定詞 Quantitative Determinatives Neqb 後置數量定詞 Post-quantitative
Determinatives Nes 特指定詞 Specific Determinatives Neu 數詞定詞 Numeral Determinatives Nf
量詞 Measure Ng 後置詞 Postposition Nh 代名詞 Pronoun P 介詞 Preposition SHI 是 you3 (to have)
T 語助詞 Particle VA 動作不及物動詞 Active Intransitive Verb VAC 動作使動動詞 Active Causative Verb
VB 動作類及物動詞 Active Pseudo-transitive Verb VC 動作及物動詞 Active Transitive Verb VCL 動作接地方賓語動詞
Active Verb with a Locative Object VD 雙賓動詞 Ditransitive Verb VE 動作句賓動詞 Active Verb
with a Sentential Object VF 動作謂賓動詞 Active Verb with a Verbal Object VG 分類動詞 Classificatory
Verb VH 狀態不及物動詞 Stative Intransitive Verb VHC 狀態使動動詞 Stative Causative Verb VI 狀態類及物動詞
Stative Pseudo-transitive Verb VJ 狀態及物動詞 Stative Transitive Verb VK 狀態句賓動詞 Stative
Verb with a Sentential Object VL 狀態謂賓動詞 Stative Verb with a Verbal Object V_2 有 有
Since neither manual checking nor automatic checking against a gold standard is feasible
for gigaword size corpora, the authors proposed quality assurance of automatic annotation
of very large corpora based on heterogeneous CKIP and ICTCLAS tagging systems (Huang
et al., 2008). By comparing to word lists generated from the ICTCLAS version of an
automatic tagged Xinhua portion of Chinese Gigaword, a set of heuristics for building
out-of-vocabulary dictionaries to improve quality were proposed. Randomly selected
texts for evaluating effects of these out-of-vocabulary dictionaries were manually
checked. Experimental results indicate that there were 30,562 correct words (about
97.3 %) of tested words. The quality control test result follows: Corpora Thousands
of words No. Test words No. Correct Words CNA 501459 42,695 41,449 XIN 311718 28,744
27,967 ZBN 18632 22,825 22,270 Total 831809 31,421 30,562
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Chu-Ren
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635170
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 219-593-002-727-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 was prepared by LDC and contains
240,000 characters (112 files) of Chinese newsgroup text and its translation selected
from twenty-five sources. Newsgroups consist of posts to electronic bulletin boards,
Usenet newsgroups, discussion groups and similar forums. This release was used as
training data in Phase 1 (year 1) of the DARPA-funded GALE program. . *Source Data*
Preparating the source data involved four stages of work: data scouting, data harvesting,
formating, and data selection. Data scouting involved manually searching the web for
suitable newsgroup text. Data scouts were assigned particular topics and genres along
with a production target in order to focus their web search. Formal annotation guidelines
and a customized annotation toolkit helped data scouts to manage the search process
and to track progress. Data scouts logged their decisions about potential text of
interest (sites, threads and posts) to a database. A nightly process queried the annotation
database and harvested all designated URLs. Whenever possible, the entire site was
downloaded, not just the individual thread or post located by the data scout. Once
the text was downloaded, its format was standardized (by running various scripts)
so that the data could be more easily integrated into downstream annotation processes.
Original-format versions of each document were also preserved. Typically, a new script
was required for each new domain name that was identified. After scripts were run,
an optional manual process corrected any remaining formatting problems. The selected
documents were then reviewed for content-suitability using a semi-automatic process.
A statistical approach was used to rank a document's relevance to a set of already-selected
documents labeled as "good." An annotator then reviewed the list of relevance-ranked
documents and selected those which were suitable for a particular annotation task
or for annotation in general. These newly-judged documents in turn provided additional
input for the generation of new ranked lists. Manual sentence unit/segment (SU) annotation.was
also performed on a subset of files following LDC's Quick Rich Transcription specification.
Three types of end of sentence SU were identified: statement SU, question SU and incomplete
SU. *Translation* After files were selected, they were reformatted into a human-readable
translation format, and the files were then assigned to professional translators for
careful translation. Translators followed GALE Translation guidelines which describe
the makeup of the translation team, the source data format, the translation data format,
best practices for translating certain linguistic features (such as names and speech
disfluencies), and quality control procedures applied to completed translations. TDF
Format All final data are in Tab Delimited Format (TDF). TDF is compatible with other
transcription formats, such as the Transcriber format and AG format, and it is easy
to process. Each line of a TDF file corresponds to a speech segment and contains 13
tab delimited fields: field data_type 1 file unicode 2 channel int 3 start float 4
end float 5 speaker unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript
unicode 9 section int 10 turn int 11 segment int 12 sectionType unicode 13 suType
unicode A source TDF file and its translation are the same except that the transcript
in the source TDF is replaced by its English translation. Some fields are inapplicable
to newsgroup text. Those include the channel, start time, end time and speaker dialect
fields. These fields are either empty or contain values as a placeholder. Encoding
All data are encoded in UTF8. *Sponsorship * This work was supported in part by the
Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or the policy
of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635189
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 202-219-770-615-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Spanish Gigaword Second Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Spanish Gigaword Second Edition is a comprehensive archive of newswire text data that
has been acquired over several years by LDC. This second edition updates Spanish Gigaword
First Edition (LDC2006T12) and adds data collected from January 1, 2006 through December
31, 2008. The three distinct international sources of Spanish newswire in this edition,
and the time spans of collection covered for each, are as follows: * Agence France-Presse,
Spanish Service (afp_spa) May 1994 - Dec 2008 * Associated Press Worldstream, Spanish
(apw_spa) Nov 1993 - Dec 2008 * Xinhua News Agency, Spanish Service (xin_spa) Sep
2001 - Dec 2008 The seven-letter codes in the parentheses above include the three-character
source name abbreviations and the three-character language code (spa) separated by
an underscore (_) character. The three-letter language code conforms to LDCs internal
convention based on the ISO 639-3 standard. These codes are used in the directory
names where the data files are found and in the prefix that appears at the beginning
of every data file name. They are also used (in all UPPER CASE) as the initial portion
of the DOC id strings that uniquely identify each news story. *Data* The overall totals
for each source are summarized below. Note that the Totl-MB numbers show the amount
of data obtained when the files are uncompressed (i.e. approximately 7 gigabytes,
total) the Gzip-MB column shows totals for compressed file sizes as stored on the
DVD-ROM the K-wrds numbers are simply the number of whitespace-separated tokens (of
all types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds
#DOCs AFP_SPA 175 1182 3512 506562 1748787 APW_SPA 180 886 2721 402718 1244811 XIN_SPA
88 405 1238 182543 734356 TOTAL 443 2453 7471 1091823 3727954 The following tables
present Text-MB, K-wrds and #DOCS broken down by source and DOC type Text-MB represents
the total number of characters (including whitespace) after SGML tags are eliminated.
Text-MB K-wrds #DOCs type=advis: AFP_SPA 144 20520 45446 APW_SPA 41 6173 11112 XIN_SPA
0 0 0 TOTAL 185 26693 56558 type=multi: AFP_SPA 84 12711 15346 APW_SPA 351 55758 107224
XIN_SPA 189 29970 56372 TOTAL 624 98439 178942 type=other: AFP_SPA 275 38665 160815
APW_SPA 296 40517 162448 XIN_SPA 44 6376 50168 TOTAL 615 85558 373431 type=story:
AFP_SPA 2771 434677 1527180 APW_SPA 1875 300274 964027 XIN_SPA 911 146199 627816 TOTAL
5557 881150 3119023
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Spanish language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mendonça, Ângelo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635219
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 677-375-027-082-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Newswire English Translation Collection
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Arabic English Newswire Translation Collection was produced by the Linguistic
Data Consortium (LDC). It consists of approximately 550,000 words of Arabic newswire
text and its English translation from Agence France Presse (France), An Nahar (Lebanon)
and Assabah (Tunisia). The source Arabic text was used in LDC's Arabic Treebank, specifically,
in Part 1 (Part 1 v. 2.0; Part 1 v. 3.0), Part 3 (Part 3 v. 1.0; Part 3 v. 2.0) and
Part 4 (Part 4 v. 1.0). A subset of Agence France Presse (AFP) source text from Arabic
Treebank: Part 1 v. 2.0 was previously translated and released by LDC in Arabic Treebank:
Part 1 - 10K-word English Translation, LDC2003T07. The English translations in this
corpus were provided by translation agencies using LDC's Arabic Translation Guidelines.
*Data* The number of stories and their epochs for each source are as follows: AFP
734 stories; July 2000 - November 2000 An Nahar 600 stories; January 2002 - December
2002 Assabah 397 stories; September 2004 - November 2004 Total 1731 stories Word count
of Arabic tokens by source is shown in the following table: AFP 102,564 An Nahar 299,681
Assabah 149,259 Total 551,504 The original source files used different encodings for
the Arabic characters, including UTF8 and ASMO. SGML tags were used for marking sentence
and paragraph boundaries and for annotating other information about each story. All
Arabic source data was converted to UTF and most SGML tags were removed or replaced
by "plain text" markers.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Translations into English
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zakhary, Dalal
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635227
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 293-001-569-603-0
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
FactBank 1.0, Linguistic Data Consortium (LDC) catalog number LDC2009T23 and isbn
1-58563-522-7, consists of 208 documents (over 77,000 tokens) from newswire and broadcast
news reports in which event mentions are annotated with their degree of factuality,
that is, the degree to which they correspond to those events. FactBank 1.0 was built
on top of TimeBank 1.2 and a fragment of the AQUAINT TimeML Corpus, both of which
used the TimeML specification language. This resulted in a double-layered annotation
of event factuality. TimeBank 1.2 and AQUAINT TimeML encode most of the basic structural
elements expressing factuality information while FactBank 1.0 represents the resulting
factuality interpretation. The combination of the factuality values in FactBank with
the structural information in TimeML-annotated corpora facilitates the development
of tools aimed at automatically identifying the factuality values of events, a component
fundamental in tasks requiring some degree of text understanding, such as Textual
Entailment, Question Answering, or Narrative Understanding. FactBank annotations indicate
whether the event mention describes actual situations in the world, situations that
have not happened, or situations of uncertain interpretation. Event factuality is
not an inherent feature of events but a matter of perspective. Different discourse
participants may present divergent views about the factuality of the very same event.
Consequently, in FactBank, the factuality degree of events is assigned relative to
the relevant sources at play. In this way, it can adequately reflect the divergence
of opinions regarding the factual status of events, as is common in news reports.
The annotation language is grounded on established linguistic analyses of the phenomenon,
which facilitated the creation of a battery of discriminatory tests for distinguishing
between factuality values. Furthermore, the annotation procedure was carefully designed
and divided into basic, sequential annotation tasks. This made it possible for hard
tasks to be built on top of simpler ones, while at the same time allowing annotators
to become incrementally familiar with the complexity of the problem. As a result,
FactBank annotation achieved a relatively high interannotation agreement, kappa=0.81,
a positive result when considered against similar annotation efforts. *Data* All FactBank
markup is standoff and is represented through a set of 20 tables which can be easily
loaded into a database. Each table resides in an independent text file, where fields
are separated by three consecutive bars (i.e., |||). The data in fields of string
type are presented between simple quotations ('). Because FactBank 1.0 was built on
top of TimeBank 1.2 and AQUAINT TimeML, both of which are marked up with inline XML-based
annotation, this release contains the TimeBank 1.2 and AQUAINT TimeML annotation in
standoff, table-based format as well.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Journalism
- General subdivision:
Objectivity
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Facts (Philosophy)
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Grammar, Comparative and general
- General subdivision:
Parsing
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sauri, Roser
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pustejovsky, James
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635235
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 644-574-573-711-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CSLU: S4X Release 1.2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CSLU: S4X Release 1.2, Linguistic Data Consortium (LDC) catalog number LDC2009S03
and isbn 1-58563-523-5, was created by the Center for Spoken Language Understanding,
Oregon Health and Science University (CSLU). The corpus consists of 36 speakers (22
male, 14 female) uttering 11 specified words. The speakers repeated the following
words six times on each of four channels: startrek, supernova, tektronix, generation,
nebula, processing, singularity, 71523, abracadabra, sungeeta and computer. The four
channels used were office phone, home phone, carbon microphone telephone and speaker
phone. Each speech file has a corresponding time-aligned phoneme-level transcription
(achieved using automatic forced alignment) and an automatically-generated world-level
transcription. Humans reviewed each utterance in two passes and classified it as good,
bad, noisy or different. The results of this verification process are included in
the /docs directory. *Data* The data was recorded with the CSLU T1 digital data collection
system. Each utterance is recorded as a separate file. These files were sampled at
8 khz 8-bit and stored as ulaw files. All of the data use the RIFF standard file format.
This file format is 16-bit linearly encoded.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Ronald Allan
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lander, T.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Durham, T.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635243
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 591-792-796-939-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
OntoNotes Release 3.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The OntoNotes project is a collaborative effort between BBN Technologies, the University
of Colorado, the University of Pennsylvania, and the University of Southern California's
Information Sciences Institute. The goal of the project is to annotate a large corpus
comprising various genres of text (news, conversational telephone speech, weblogs,
use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic)
with structural information (syntax and predicate argument structure) and shallow
semantics (word sense linked to an ontology and coreference). OntoNotes Release 3.0
is a continuation of the OntoNotes project and is supported by the Defense Advanced
Research Projects Agency, GALE Program Contract No. HR0011-06-C-0022. OntoNotes Release
1.0 (LDC2007T21) contains 400k words of Chinese newswire data (from Xinhua News Agency
and Sinorama Magazine) and 300k words of English newswire data (from the Wall Street
Journal). OntoNotes Release 2.0 (LDC2008T04) added the following to the corpus: 274k
words of Chinese broadcast news data (from China Broadcasting System, China Central
TV, China National Radio, China Television System and Voice of America); and 200k
words of English broadcast news data (from ABC, CNN, NBC, Public Radio International
and Voice of America). OntoNotes Release 3.0 incorporates the following new material:
250k words of English newswire data (from the Wall Street Journal and Xinhua News
Agency), 200k of English broadcast news data (from ABC, CNN, NBC, Public Radio International
and Voice of America); 200k words of English broadcast conversation material (translated
from China Central TV and Phoenix TV); 250k words of Chinese newswire data (from Xinhua
News Agency and Sinorama Magazine); 250k words of Chinese broadcast news material
(from China Broadcasting System, China Central TV, China National Radio, China Television
System and Voice of America); 150k words of Chinese broadcast conversation data (from
China Central TV and Phoenix TV); and 200k words of Arabic newswire material (from
An Nahar). Natural language applications like machine translation, question answering
and summarization currently are forced to depend on impoverished text models like
bags of words or n-grams, while the decisions that they are making ought to be based
on the meanings of those words in context. That lack of semantics causes problems
throughout the applications. Misinterpreting the meaning of an ambiguous word results
in failing to extract data, incorrect alignments for translation, and ambiguous language
models. Incorrect coreference resolution results in missed information (because a
connection is not made) or incorrectly conflated information (due to false connections).
OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax
and the Penn PropBank for predicate-argument structure. Its semantic representation
will include word sense disambiguation for nouns and verbs, with each word sense connected
to an ontology, and coreference. The current goals call for annotation of over a million
words each of English and Chinese, and half a million words of Arabic over five years.
*Data* Each data directory has been stored as a Gnu Zipped Tar File (.tgz) due to
the complexity and depth of each directory and the limitations of the ISO CD9660 file
system for CD and DVD media. These directories may be easily unpacked using the Unix
command line or using utilities such as StuffIt or WinZip under Windows.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Standard Arabic, Chinese, and Arabic. Documentation
in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Weischedel, Ralph
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pradhan, Sameer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ramshaw, Lance
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kaufman, Jeff
ADDED ENTRY--PERSONAL NAME
- Personal name:
Franchini, Michelle
ADDED ENTRY--PERSONAL NAME
- Personal name:
El-Bachouti, Mohammed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitchell
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taylor, Ann
ADDED ENTRY--PERSONAL NAME
- Personal name:
Greenberg, Craig
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hovy, Eduard
ADDED ENTRY--PERSONAL NAME
- Personal name:
Belvin, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Houston, Ann
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u swe d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635251
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 930-499-840-946-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
swe
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rum
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
pol
- Language code of text/sound track or separate title:
dut
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
ger
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
swe
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
ron
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
pol
- Language code of text/sound track or separate title:
nld
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
deu
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Web 1T 5-gram, 10 European Languages Version 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Web 1T 5-gram, 10 European Languages Version 1 was created by Google, Inc. It consists
of word n-grams and their observed frequency counts for ten European languages: Czech,
Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish.
The length of the n-grams ranges from unigrams (single words) to five-grams. The n-gram
counts were generated from approximately one hundred billion word tokens of text for
each language, or approximately one trillion total tokens. The n-grams were extracted
from publicly-accessible web pages from October 2008 to December 2008. This data set
contains only n-grams that appeared at least 40 times in the processed sentences.
Less frequent n-grams were discarded. While the aim was to identify and collect pages
from the specific target languages only, it is likely that some text from other languages
may be in the final data. This dataset will be useful for statistical language modeling,
including machine translation, speech recognition and other uses. *Data* The input
encoding of documents was automatically detected, and all text was converted to UTF8.
The following table contains statistics for the entire release. File sizes (entire
corpus): approximately 27.9 GB compressed (bzip2) text files Total number of tokens:
1,306,807,412,486 Total number of sentences: 150,727,365,731 Total number of unigrams:
95,998,281 Total number of bigrams: 646,439,858 Total number of trigrams: 1,312,972,925
Total number of fourgrams: 1,396,154,236 Total number of fivegrams: 1,149,361,413
Total number of n-grams: 4,600,926,713
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Swedish, Spanish, Romanian, Portuguese, Polish, Dutch, Italian, French,
German, and Czech. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
- General subdivision:
Statistical methods
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brants, Thorsten
ADDED ENTRY--PERSONAL NAME
- Personal name:
Franz, Alex
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T26
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 922-902-627-783-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NXT Switchboard Annotations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T26
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NXT Switchboard Annotations, brings together in NITE XML, a single XML format, the
multiple layers of annotation performed on a transcript subset from Switchboard 1-
Release 2, LDC97S62. NXT Switchboard Annotations was developed in a collaboration
among researchers from Edinburgh University, Stanford University and the University
of Washington. The original Switchboard corpus is a collection of spontaneous telephone
conversations between previously unacquainted speakers of American English on a variety
of topics chosen from a pre-determined list. A subset of one million words from those
conversations was annotated for syntactic structure and disfluencies as part of the
Penn Treebank project. Phonetic transcripts were generated by the International Computer
Science Institute, University of California Berkeley and later corrected by the Institute
for Signal Information Processing, Mississippi State Univeristy. The Penn Treebank
transcripts provided the basis for the NXT Switchboard corpus, and the noun phrases
from that subset were annotated for animacy. The Treebank transcript was then aligned
with the corresponding subset from the corrected Mississippi State (MS-State) transcript
in order to provide word timing information. Focus/contrast and prosodic annotations,
as well as phone/syllable alignment were next added to the annotations. The previous
annotations of dialog acts and prosody were converted to NITE XML. Lastly, hand annotations
for markables were added to provide information about their animacy and information
structure, including coreferential links. *NXT Annotation* NXT is an open source toolkit
that enables mutiple linguistic annotations to be assembled into a unified database.
It uses a stand-off XML data format that consists of several XML files that point
to each other. The NXT format provides a data model that describes how the various
annotations for a corpus relate to one another. For that reason, it does not impose
any particular linguistic theory or any particular markup structure. Instead, users
define their annotations in a "metadata" file that expresses their contents and how
they relate to each other in terms of the graph structure for the corpus annotations
overall. The relationships that can be defined in the data model draw annotations
together into a set of intersecting trees, but also allow arbitrary links between
annotations over the top of this structure, giving a representation that is highly
expressive, easier to process than arbitrary graphs and structured in a way that helps
data users. NXT's other core component is a query language designed specifically for
working with data conforming to this data model. Together, the data model and query
language allow annotations to be treated as one coherent set containing both structural
and timing information. The data in NXT Swtichboard Annotations was converted from
the Penn Treebank bracketed format in which the Switchboard corpus was originally
distributed using an XML-based tool for syntactic query that comes with a ready-made
Switchboard converter. Conversion was performed using a set of XSL stylesheets to
extract each of the multiple XML files associated with one dialogue. The data was
divided into separate XML files representing the orthographic transcription, syntax,
turn structure, disfluencies and movement, or the relationship between traces and
their sources. Transcription consists of a flat list of terminals: words, punctuation,
traces, and so on. Syntax starts with a flat list of parses and works down through
nonterminals, grounding in terminals (which are in the transcription file, but are
referenced by pointers that indicate they are to be treated as if they were part of
the tree itself). Turn structure is simply a flat list of turns that themselves contain
parses as children, again via pointers into the syntax file. Yet another file couples
reparanda and repairs into disfluencies by pointing to the appropriate nonterminals
using named roles. A movement file similarly links sources with their target traces.
While this representation may seem awkward, it has advantages over the original arrangement.
First, it places the information in a single tree structure, with co-indexing for
the crossing links that are sometimes required for disfluency and movement. Secondly,
it facilitates querying the crossing structures, since they are treated on a par with
other structures within the data. Although this ease is not particularly important
for the initial, syntactic data, it is crucial for a correct understanding of discourse
phenomena such as coreference. Third, separating the tags into their various types
makes it easier to add data using external processes (part-of-speech taggers, named
entity recognizers, and the like). Fourth, different people can change different data
files at the same time without conflict, as long as neither edit the files they point
to and both are able to lock complete paths of files pointing to the data they are
revising. Last, a data set can be loaded in whole or in part, speeding up some processing.
The NITE XML Toolkit itself treats the data seamlessly no matter whether it is in
one file or many. *Licensing* This corpus is made available to LDC not-for-profit
members and all nonmembers under the Creative Commons Attribution-Noncommercial Share
Alike 3.0 license. NXT Switchboard Annotations is available to LDC's for-profit members
under the terms of their For-Profit Membership Agreements.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Calhoun, Sasha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Carletta, Jean
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jurafsky, Daniel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Nissim, Malvina
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ostendorf, Mari
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zaenen, Annie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T26
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635278
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T27
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 261-416-300-929-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Gigaword Fourth Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T27
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number LDC2009T27
and isbn 1-58563-527-8, is a comprehensive archive of newswire text data that has
been acquired over several years by the LDC. This edition includes all of the contents
in Chinese Gigaword Third Edition (LDC2007T38) as well as newly collected data. In
addition, four entirely new sources have been added in the fourth edition, Central
News Service, Guangming Daily, Peoples Liberation Army Daily, and Peoples Daily. The
eight distinct international sources of Chinese newswire included in this edition
are the following: * Agence France Presse (afp_cmn) * Central News Agency, Taiwan
(cna_cmn) * Central News Service (cns_cmn) * Guangming Daily (gmw_cmn) * Peoples Daily
(pda_cmn) * Peoples Liberation Army Daily (pla_cmn) * Xinhua News Agency (xin_cmn)
* Zaobao Newspaper (zbn_cmn) The seven-letter codes in the parentheses above are used
for the directory names and data files for each source, and are also used (in ALL_CAPS)
as part of the unique DOC id string assigned to each news article. *Data* The original
data received by the LDC from AFP, Peoples Liberation Army Daily, Xinhua, and Zaobao
were encoded in GB-2312, those from CNA were in Big-5, and those from GMW, CNS, and
Peoples Daily were in a combination of GB-2312 and GB-18030. To avoid the problems
and confusion that could result from differences in character-set specifications,
all text files in this corpus have been converted to UTF-8 character encoding. *New
in the Fourth Edition* * Two years worth of new articles (January 2007 through December
2008) have been added to the Xinhua, Agence France Presse, and CNA data sets. * Four
new data sources have been added - Guangming Daily, Central News Service, Peoples
Daily and Peoples Liberation Army daily, covering a timespan from November 2006 through
December 2008.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Parker, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T27
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u fre d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635286
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T28
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 739-169-067-045-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
fre
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
fra
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
French Gigaword Second Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T28
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
French Gigaword Second Edition is a comprehensive archive of newswire text data that
has been acquired over several years by LDC. This second edition updates French Gigaword
First Edition (LDC2006T7) and adds material collected from August 1, 2006 through
December 31, 2008. The two distinct international sources of French newswire in this
edition, and the time spans of collection covered for each, are as follows: * Agence
France-Presse (afp_fre) May 1994 - Dec 2008 * Associated Press Worldstream, French
(apw_fre) Nov 1994 - Dec 2008 The seven-letter codes in parentheses include the three-character
source name abbreviations and the three-character language code (fre) separated by
an underscore (_) character. The three-letter language code conforms to LDCs internal
convention based on the ISO 639-3 standard. These codes are used in the directory
names where the data files are found and in the prefix that appears at the beginning
of every data file name. They are also used (in all UPPER CASE) as the initial portion
of the DOC id strings that uniquely identify each news story. *Data* The overall totals
for each source are summarized below. The Totl-MB numbers show the amount of data
obtained when the files are uncompressed (i.e., approximately 15 gigabytes, total)
the Gzip-MB column shows totals for compressed file sizes as stored on the DVD-ROM
and the K-wrds numbers are the number of whitespace-separated tokens (of all types)
after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFP_FRE
172 2408 4079 560000 2060803 APW_FRE 171 2280 1719 241324 0872573 TOTAL 343 4688 5789
801324 2933376 The following tables present Text-MB, K-wrds and #DOCS broken down
by source and DOC type Text-MB represents the total number of characters (including
whitespace) after SGML tags are eliminated. Source Text-MB K-wrds #DOCs type=advis:
AFP_FRE 88 11788 48712 APW_FRE 14 2303 9235 TOTAL 103 14091 57947 type=multi: AFP_FRE
59 8411 10269 APW_FRE 194 29828 52240 TOTAL 253 38239 62509 type=other: AFP_FRE 178
58514 8411 APW_FRE 82 193981 29828 TOTAL 260 38239 38239 type=story: AFP_FRE 1824
198440 27216 APW_FRE 729 87662 13006 TOTAL 2553 286102 40222 The data has undergone
a consistent extent of quality control to eliminate out-of-band content and other
obvious forms of corruption. Since the source data is generated manually on a daily
basis, there will be a small percentage of human errors common to all sources: missing
whitespace, incorrect or variant spellings, badly formed sentences, and so on, as
are normally seen in newspapers. No attempt has been made to address this property
of the data.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in French. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
French language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mendonça, Ângelo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T28
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635294
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 994-591-828-190-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ger
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
deu
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
pes
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2007 NIST Language Recognition Evaluation Test Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2007 NIST Language Recognition Evaluation Test Set consists of 66 hours of conversational
telephone speech segments in the following languages and dialects: Arabic, Bengali,
Chinese (Cantonese), Mandarin Chinese (Mainland, Taiwan), Chinese (Min), English (American,
Indian), Farsi, German, Hindustani (Hindi, Urdu), Korean, Russian, Spanish (Caribbean,
non-Caribbean), Tamil, Thai and Vietnamese. The goal of the NIST (National Institute
of Standards and Technology) Language Recognition Evaluation (LRE) is to establish
the baseline of current performance capability for language recognition of conversational
telephone speech and to lay the groundwork for further research efforts in the field.
NIST conducted three previous language recognition evaluations, in 1996, 2003 and
2005. The most significant differences between those evaluations and the 2007 task
were the increased number of languages and dialects, the greater emphasis on a basic
detection task for evaluation and the variety of evaluation conditions. Thus, in 2007,
given a segment of speech and a language of interest to be detected (i.e., a target
language), the task was to decide whether that target language was in fact spoken
in the given telephone speech segment (yes or no), based on an automated analysis
of the data contained in the segment. Further information regarding this evaluation
can be found in the evaluation plan which is included in the documentation for this
release. The training data for LRE 2007 consists of the following: * 2003 NIST Language
Recognition Evaluation, LDC2006S31. This material is comprised of: (1) approximately
46 hours of conversational telephone speech segments in the target languages and dialects
and (2) the 1996 LRE test data (conversational telephone speech in Arabic (Egyptian
colloquial), English (General American, Southern American), Farsi, French, German,
Hindi, Japanese, Korean, Mandarin Chinese (Mainland, Taiwan), Spanish (Caribbean,
non-Caribbean), Tamil and Vietnamese). * 2005 NIST Language Recognition Evaluation,
LDC2008S05. This release consists of approximately 44 hours of conversational telephone
speech in English (American, Indian), Hindi, Japanese, Korean, Mandarin Chinese (Mainland,
Taiwan), Spanish (Mexican) and Tamil. * Supplemental test data to be released by LDC
in late 2009, 2007 NIST Language Recognition Evaluation Supplemental Training Data,
LDC2009S05. *Data* Each speech file in the test data is one side of a 4-wire telephone
conversation represented as 8-bit 8-kHz mu-law format. There are 7530 speech files
in SPHERE (.sph) format for a total of 66 hours of speech. The speech data was compiled
from LDCs CALLFRIEND, Fisher Spanish and Mixer 3 corpora and from data collected by
Oregon Health and Science University, Beaverton, Oregon. The test segments contain
three nominal durations of speech: 3 seconds, 10 seconds and 30 seconds. Actual speech
durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13
seconds and 23-35 seconds, respectively. Non-speech portions of each segment were
included in each segment so that a segment contained a continuous sample of the source
recording. Therefore, the test segments may be significantly longer than the speech
duration, depending on how much non-speech was included. Unlike previous evaluations,
the nominal duration for each test segment was not identified.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese, Vietnamese, Thai, Tamil, Spanish, Russian, Korean, Japanese,
Hindi, Persian, English, German, Mandarin Chinese, Bengali, Standard Arabic, Dari,
Iranian Persian, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Le, Audrey
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635308
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 498-359-265-464-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
wuu
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
nan
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2007 NIST Language Recognition Evaluation Supplemental Training Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2007 NIST Language Recognition Evaluation Supplemental Training Se consists of 118
hours of conversational telephone speech segments in the following languages and dialects:
Arabic (Egyptian colloquial), Bengali, Min Nan Chinese, Wu Chinese, Taiwan Mandarin,
Cantonese, Russian, Mexican Spanish, Thai, Urdu and Tamil. The goal of the NIST (National
Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to
establish the baseline of current performance capability for language recognition
of conversational telephone speech and to lay the groundwork for further research
efforts in the field. NIST conducted three previous language recognition evaluations,
in 1996, 2003 and 2005. The most significant differences between those evaluations
and the 2007 task were the increased number of languages and dialects, the greater
emphasis on a basic detection task for evaluation and the variety of evaluation conditions.
Thus, in 2007, given a segment of speech and a language of interest to be detected
(i.e., a target language), the task was to decide whether that target language was
in fact spoken in the given telephone speech segment (yes or no), based on an automated
analysis of the data contained in the segment. The supplemental training material
in this release consists of the following: * Approximately 53 hours of conversational
telephone speech segments in Arabic (Egyptian colloquial), Bengali, Cantonese, Min
Nan Chinese,Wu Chinese, Russian, Thai and Urdu. This material is taken from LDC's
CALLHOME, CALLFRIEND and Mixer collections. * Approximately 65 hours of full telephone
conversations in Mandarin Chinese (Taiwan), Spanish (Mexican) and Tamil. This material
was collected by Oregon Health and Science University (OHSU), Beaverton, Oregon. The
test segments used in the 2005 NIST Language Recognition Evaluation were derived from
these full conversations. In addition to the supplemental material contained in this
release, the training data for the 2007 NIST Language Recognition Evaluation consisted
of data from previous LRE evaluation test sets, namely, 2003 NIST Language Recognition
Evaluation and 2005 NIST Language Recognition Evaluation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese, Wu Chinese, Urdu, Thai, Tamil, Spanish, Russian, Min Nan Chinese,
Mandarin Chinese, Bengali, and Egyptian Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Le, Audrey
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
van Santen, Jan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635316
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T29
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 150-170-243-077-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ACL Anthology Reference Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T29
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ACL Anthology Reference Corpus is a digital archive of 10,291 research papers in computational
linguistics sponsored by the Association for Computational Linguistics (ACL). Also
available from the ACL, this release contains most of the papers that appear up to
February 2007 in the web-based ACL Anthology, a dynamic repository that currently
hosts over 16,500 articles drawn from a range of conferences and workshops as well
as past issues of the Computational Linguistics journal. The ACL Anthology Reference
Corpus is designed to be a standard, real-world digital collection testbed for experiments
in bibliographic and bibliometric research. The ACL is the international scientific
and professional society for scholars working on problems involving natural language
and computation. Membership includes the ACL quarterly journal, Computational Linguistics,
reduced registration at most ACL-sponsored conferences, discounts on ACL-sponsored
publications and participation in ACL Special Interest Groups. Since 1988, Computational
Linguistics has been the primary forum for research on computational linguistics and
natural language processing. *Data* The material in the ACL Anthology Reference Corpus
was scanned at 600dpi grayscale for archival storage, down-sampled to 300dpi black-and-white,
assembled into articles and stored in the "PDF Image with Hidden Text" format. Author
and title metadata was extracted from the OCRed text and used to build HTML index
pages. Older materials, such as conference proceedings from the 1960s and early volumes
of Computational Linguistics, were manually digitized from microfiche slides. ACL
Reference Anthology includes: * 10,921 PDF files in the pdf/anthology-PDF tree. *
13,551 files with metadata described in the metadata/anthology-XML tree * 84,542 pages
in the PDF files
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kan, Min-Yen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bird, Steven
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T29
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2009 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635324
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2009T30
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 766-411-032-967-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Gigaword Fourth Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2009]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2009T30
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number LDC2009T30
and ISBN 1-58563-532-4, is a comprehensive archive of Arabic newswire text that has
been acquired over several years at LDC. Arabic Gigaword Fourth Edition includes all
of the content of Arabic Gigaword Third Edition (LDC2007T40) as well as newly-collected
data. In addition, three new sources have been added in the fourth edition: Al-Ahram,
Asharq Al-Awsat and Al-Quds Al-Arabi. Nine distinct international sources of Arabic
newswire are represented here: * Al-Ahram (ahr_arb) * Asharq Al-Awsat (aaw_arb) *
Agence France Presse (afp_arb) * Assabah (asb_arb) * Al Hayat (hyt_arb) * An Nahar
(nhr_arb) * Al-Quds Al-Arabi (qds_arb) * Ummah Press (umh_arb) * Xinhua News Agency
(xin_arb) The seven-character codes shown above represent both the directory names
where the data files are found and the 7-letter prefix that appears at the beginning
of every file name. The 7-letter codes consist of the three-character source name
IDs and the three-character language code ("arb") separated by an underscore ("_")
character. These news services all use Modern Standard Arabic (MSA), so there should
be a fairly limited scope for orthographic and lexical variation due to regional Arabic
dialects. However, to the extent that regional dialects might have an influence on
MSA usage, the following should be noted: * Al-Ahram is based in Cairo, Egypt. * Asharq
Al-Awsat is based in London, England, UK. * An Nahar is based in Beirut, Lebanon.
* Al Hayat was originally a Lebanese news service, but it has been based in London
during the entire period represented in this archive. * Assabah is based in Tunisia.
* The Xinhua and Agence France Presse (AFP) services are obviously international in
scope (Xinhua is based in Beijing, AFP in Paris), and the regional distribution of
Arabic reporters and editors for these services is not known. * The content provided
by Ummah Press comes from diverse sources throughout the Arabic-speaking world. *
Al-Quds Al-Arabi is based in London, England, UK. *New in the Fourth Edition* * New
Sources This release marks the first edition of Arabic Gigaword to include content
from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the period from November
2006 through December 2008. * New Data for Existing Sources This release contains
all data collected by LDC from January 2007 through December 2008, except for Ummah
Press for which data from January 2005 through December 2008 is included. The table
below shows data quantity by source under the following categories: data source (Source);
the number of files per source (#Files); compressed file size (Gzip-MB); uncompressed
file size (Totl-MB); the number of space-separated words tokens in the text (K-words);
and the number of documents per source (#DOCs). Source #Files Gzip-MB Totl-MB K-wrds
#DOCs aaw_arb 26 114 386 36694 87506 afp_arb 176 530 1979 184631 930656 ahr_arb 26
114 131 42265 107187 asb_arb 52 45 149 14322 32794 hyt_arb 166 663 2224 209318 448335
nhr_arb 157 784 2662 253559 557151 qds_arb 26 62 198 18996 49352 umh_arb 68 9.3 31
2995 11350 xin_arb 91 245 890 85689 492664 Totals 788 5018 8650 848469 2716995
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Parker, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2009T30
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u urd d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635332
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 000-078-785-720-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This file contains documentation for NIST Open Machine Translation 2008 Evaluation
(MT08) Selected Reference and System Translations, Linguistic Data Consortium (LDC)
catalog number LDC2010T01 and isbn 1-58563-533-2. NIST Open MT is an evaluation series
to support research in, and help advance the state of the art of, technologies that
translate text between human languages. Participants submit machine translation output
of source language data to NIST (National Institute of Standards and Technology);
the output is then evaluated with automatic and manual measures of quality against
high quality human translations of the same source data. This program supports the
growing interest in system combination approaches that generate improved translations
from output of several different machine translation (MT) systems. MT system combination
approaches require data sets composed of high-quality human reference translations
and a variety of machine translations of the same text. The NIST Open Machine Translation
2008 Evaluation (MT08) Selected Reference and System Translations set addresses this
need. The data in this release consists of the human reference translations and corresponding
machine translations for the NIST Open MT08 test sets, which consist of newswire and
web data in the four MT08 language pairs -- Arabic-to-English, Chinese-to-English,
English-to-Chinese (newswire only) and Urdu-to-English. Two documents per language
pair and genre were removed at random from the test sets for release. For the machine
translations, only output from one submission (in most cases, the participant's primary
submission) per training condition (Constrained and Unconstrained training, where
available) per participant is included. See section 2 of the MT08 Evaluation Plan
for a description of the training conditions. The resulting data set has the following
characteristics: * Arabic-to-English: 120 documents with 1312 segments, output from
17 machine translation systems. * Chinese-to-English: 105 documents with 1312 segments,
output from 23 machine translation systems. * English-to-Chinese: 127 documents with
1830 segments, output from 11 machine translation systems. * Urdu-to-English: 128
documents with 1794 segments, output from 12 machine translation systems. The data
is organized and annotated in such a way that subsets for each language pair and/or
data genre and/or training condition can be extracted and used separately, depending
on the user's needs.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Urdu, Mandarin Chinese, Standard Arabic, English, Chinese, and Arabic.
Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635340
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 539-629-573-162-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Czech Broadcast News MDE Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Czech Broadcast News MDE Transcripts, Linguistic Data Consortium (LDC) catalog number
LDC2010T02 and isbn 1-58563-534-0, was prepared by researchers at the University of
West Bohemia, Pilsen, Czech Republic. It consists of metadata extraction (MDE) annotations
for the approximately 26 hours of transcribed broadcast news speech in Czech Broadcast
News Transcripts (LDC2004T01). The audio files corresponding to the transcripts in
this corpus are contained in Czech Broadcast News Speech (LDC2004S01). Czech Broadcast
News MDE Transcripts joins LDC's other holdings of Czech broadcast data: Czech Broadcast
Conversation Speech (LDC2009S02), Czech Broadcast Conversation MDE Transcripts (LDC2009T20),
Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89) and Voice of America
(VOA) Czech Broadcast News Transcripts (LDC2000T53). The audio recordings were collected
from February 1, 2000 through April 22, 2000 from three Czech radio stations (Cesky
rozhlas 1 Radiozurnal - CRo1, Cesky rozhlas 2 Praha - CRo2 and Cesky rozhlas 3 Vlatva
- CRo3) and two television stations (Ceska televize - CTV and Prima TV). The broadcasts
included both public and commercial subjects and were presented in various styles,
ranging from a formal style to a colloquial style more typical for commercial broadcast
companies that do not primarily focus on news. The goal of MDE research is to take
raw speech recognition output and refine it into forms that are of more use to humans
and to downstream automatic processes. In simple terms, this means the creation of
automatic transcripts that are maximally readable. This readability might be achieved
in a number of ways: removing non-content words like filled pauses and discourse markers
from the text; removing sections of disfluent speech; and creating boundaries between
natural breakpoints in the flow of speech so that each sentence or other meaningful
unit of speech might be presented on a separate line within the resulting transcript.
Natural capitalization, punctuation, standardized spelling and sensible conventions
for representing speaker turns and identity are further elements in the readable transcript.
The transcripts and annotations in this corpus are stored in two formats: QAn (Quick
Annotator), and RTTM. Character encoding in all files is ISO-8859-2. More information
can be found on the website Structural Metadata Annotation for Czech. *Sponsorship*
The completion of this corpus was facilitated by funding provided by the Ministry
of Education of the Czech Republic under projects No. 2C06020 and ME909.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Czech. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kolar, Jachym
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635383
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 675-764-258-846-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NPS Internet Chatroom Conversations, Release 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NPS Internet Chatroom Conversations, Release 1.0 consists of 10,567 English posts
(45,068 tokens) gathered from age-specific chat rooms of various online chat services
in October and November 2006. Each file is a text recording from one of these chat
rooms for a short period on a particular day. Users should be aware that some of the
conversations in this corpus feature subjects and language that some people may find
offensive or objectionable, including discussions of a sexual nature. This corpus
was developed by researchers at the Department of Computer Science, Naval Postgraduate
School, Monterey, California. Although much work has been accomplished in Natural
Language Processing (NLP) in traditional written and spoken language domains, relatively
little has been performed in the newer computer-mediated communication (CMC) domains
enabled by the Internet, such as text-based chat. One factor inhibiting research in
this area has been the dearth of annotated CMC corpora available to the broader research
community, despite the increasing use of CMC in a variety of areas and applications.
NPS Internet Chatroom Conversations is one of the first text-based chat corpora tagged
with lexical and discourse information. This corpus might be used to develop stochastic
NLP applications that perform tasks such as conversation thread topic detection, author
profiling, entity identification, and social network analysis. Each post is annotated
with a chat dialog-act tag, and individual tokens within each post are annotated with
part-of-speech tags. 3,507 tokenized posts were automatically tagged using a part-of-speech
tagger trained on the Penn Treebank corpora, combined with a regular expression that
identified privacy-masked user names and emoticons. Similarly, simple regular expression
matching was employed to assign an initial chat dialog-act to each of this subset
of posts. This initial tagging was verified by hand (with corrections made where necessary).
The remaining 7,060 posts were POS-tagged using a POS tagger that was trained on the
newly hand-tagged chat data and the Penn Treebank corpora. Dialog-act tagging on the
remaining posts was accomplished using a back-propagation neural network trained on
21 features of the initial dialog-act-labeled posts. The tagging of this second group
of posts was also manually verified (and corrected where necessary). Ultimately, all
of the 10,567 privacy-masked posts, representing 45,068 tokens, were annotated with
manually verified part-of-speech and dialog act information. Filenames consist of
date, target age group, and number of posts. For example, the file 10-19-20s_706posts.xml
contains 706 posts gathered from the 20s chat room on October 19, 2006. The posts
have been privacy-masked in two ways. First, all usernames have been changed to generic
names of the form "UserN", where N is a unique integer consistently used for each
respective poster across all files. The posts were then read by humans to remove other
personally identifiable information. Within each file, usernames are prepended with
the date and chat room portions of the filename. So in the above filename example,
UserN becomes 10-19-20sUserN.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Online chat groups.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Forsyth, Eric
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martell, Craig
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635359
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 782-233-680-153-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 was prepared by LDC and contains
223,000 characters (98 files) of Chinese newsgroup text and its translation selected
from twenty-one sources. Newsgroups consist of posts to electronic bulletin boards,
Usenet newsgroups, discussion groups and similar forums. This release was used as
training data in Phase 1 (year 1) of the DARPA-funded GALE program. *Source Data*
Preparating the source data involved four stages of work: data scouting, data harvesting,
formating, and data selection. Data scouting involved manually searching the web for
suitable newsgroup text. Data scouts were assigned particular topics and genres along
with a production target in order to focus their web search. Formal annotation guidelines
and a customized annotation toolkit helped data scouts to manage the search process
and to track progress. Data scouts logged their decisions about potential text of
interest (sites, threads and posts) to a database. A nightly process queried the annotation
database and harvested all designated URLs. Whenever possible, the entire site was
downloaded, not just the individual thread or post located by the data scout. Once
the text was downloaded, its format was standardized (by running various scripts)
so that the data could be more easily integrated into downstream annotation processes.
Original-format versions of each document were also preserved. Typically, a new script
was required for each new domain name that was identified. After scripts were run,
an optional manual process corrected any remaining formatting problems. The selected
documents were then reviewed for content-suitability using a semi-automatic process.
A statistical approach was used to rank a document's relevance to a set of already-selected
documents labeled as "good." An annotator then reviewed the list of relevance-ranked
documents and selected those which were suitable for a particular annotation task
or for annotation in general. These newly-judged documents in turn provided additional
input for the generation of new ranked lists. Manual sentence unit/segment (SU) annotation.was
also performed on a subset of files following LDC's Quick Rich Transcription specification.
Three types of end of sentence SU were identified: statement SU, question SU and incomplete
SU. *Translation* After files were selected, they were reformatted into a human-readable
translation format, and the files were then assigned to professional translators for
careful translation. Translators followed GALE Translation guidelines which describe
the makeup of the translation team, the source data format, the translation data format,
best practices for translating certain linguistic features (such as names and speech
disfluencies), and quality control procedures applied to completed translations. TDF
Format All final data are in Tab Delimited Format (TDF). TDF is compatible with other
transcription formats, such as the Transcriber format and AG format, and it is easy
to process. Each line of a TDF file corresponds to a speech segment and contains 13
tab delimited fields: field data_type 1 file unicode 2 channel int 3 start float 4
end float 5 speaker unicode 6 speakerType unicode 7 speakerDialect unicode 8 transcript
unicode 9 section int 10 turn int 11 segment int 12 sectionType unicode 13 suType
unicode A source TDF file and its translation are the same except that the transcript
in the source TDF is replaced by its English translation. Some fields are inapplicable
to newsgroup text. Those include the channel, start time, end time and speaker dialect
fields. These fields are either empty or contain values as a placeholder. Encoding
All data are encoded in UTF8. *Sponsorship * This work was supported in part by the
Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or the policy
of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Translating into English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635375
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 190-555-041-041-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Fisher Spanish - Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Fisher Spanish - Transcripts was developed by LDC and contains full orthographic transcripts
of the telephone speech in Fisher Spanish - Speech (LDC2010S01). Transcripts cover
roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean
Spanish speakers. The Fisher telephone conversation collection protocol was created
at LDC to address a critical need of developers trying to build robust automatic speech
recognition (ASR) systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II
and the resulting corpora, have been adapted for ASR research but were in fact developed
for language and speaker identification respectively. Although the CALLHOME protocol
and corpora were developed to support ASR technology, they feature small numbers of
speakers making telephone calls of relatively long duration with narrow vocabulary
across the collection. CALLHOME conversations are challengingly natural and intimate.
Under the Fisher protocol, a very large number of participants each make a few calls
of short duration speaking to other participants, whom they typically do not know,
about assigned topics. This maximizes inter-speaker variation and vocabulary breadth
although it also increases formality. Previous protocols such as CALLHOME, CALLFRIEND
and Switchboard relied upon participant activity to drive the collection. Fisher is
unique in being platform driven rather than participant driven. Participants who wish
to initiate a call may do so; however the collection platform initiates the majority
of calls. Participants need only answer their phones at the times they specified when
registering for the study. To encourage a broad range of vocabulary, Fisher participants
are asked to speak on an assigned topic which is selected at random from a list, which
changes every 24 hours and which is assigned to all subjects paired on that day. Some
topics are inherited or refined from previous Switchboard studies while others were
developed specifically for the Fisher protocol. In collecting data for this corpus,
attempts were made to provide a representative distribution of subjects across a variety
of demographic categories including: gender, age, dialect region, and education level.
This corpus joins other Fisher corpora: Arabic CTS Levantine Fisher Training Data
Set 3 (LDC2005S07, LDC2005T03), Fisher English Training Part 2 (LDC2005S13, LDC2005T19),
Fisher English Training Speech Part 1 (LDC2004S13, LDC2004T19), and Fisher Levantine
Arabic Conversational Telephone Speech (LDC2007S02, LDC2007T04) *Data* The transcript
files are in plain-text, tab-delimited format (tdf) with UTF-8 character encoding.
They were created with the LDC-developed transcription tool "XTrans", which allowed
for improved handling of multi-channel audio and overlapping speakers. XTrans is available
from LDC. Transcribers followed LDC's Transcription Guidelines (NQTR), which are included
with the documentation for this release. The first line of each transcript file provides
the column headings; the next two lines are "comments" that can be ignored (these
are used by XTrans; they are distinguished from non-comment lines by having an initial
semicolon ";"). Actual transcript data, with time stamps, channel number, transcript
text and additional information, begins at line 4 of each transcript file. Fisher
Spanish - Speech (LDC2010S01) provides the digital audio used as the basis for the
transcriptions in this corpus, in the form of 2-channel mu-law sample data with 8000
samples per second (as captured from the public telephone network), for 819 telephone
conversations of 10 to 12 minutes in duration. The audio files are in NIST SPHERE
format (1024-byte ASCII file headers). Native speakers of Caribbean Spanish and non-Caribbean
Spanish were recruited from within the continental United States and Puerto Rico.
The following tables provide an overview of the demographics of the participants.
The Subjects Table file, provided in the documentation, may be used to answer questions
about specific combinations of participant characteristics (including level of participation).
Participants Country Raised 47 U.S.A. 20 Argentina 14 Mexico 11 Colombia 7 Chile 6
Puerto Rico 5 Spain 5 Peru 3 Venezuela 3 Canada 3 Panama 3 Guatemala 2 Paraguay 1
Cuba 1 Honduras 1 Uruguay 1 Bolivia 1 Dominican Republic 1 Switzerland 1 Ecuador Conversation
Sides Participants 1 6 2 5 3 4 4 3 5 3 6 2 7 2 8 1 9 1 10 13 11 10 12 9 13 8 14 8
15 7 16 7 17 7 18 7 19 7 20 6 21 5 22 5 23 5 24 5 Years Education Participants 2 1
4 2 5 2 6 1 11 1 12 15 13 7 14 16 15 12 16 25 17 10 18 16 19 3 20 9 21 2 22 6 23 4
24 1 25 2 28 1 Participants Dialect 91 Non-Caribbean 45 Caribbean Participants Age
Group 23 Young 106 Middle 7 Old Participants Sex 84 Female 52 Male
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Spanish language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cartagena, Ingrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635367
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 684-070-443-287-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Fisher Spanish Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Fisher Spanish - Speech was developed by the Linguistic Data Consortium (LDC) and
consists of audio files covering roughly 163 hours of telephone speech from 136 native
Caribbean Spanish and non-Caribbean Spanish speakers. Full orthographic transcripts
of these audio files are available in Fisher Spanish - Transcripts (LDC2010T04). The
Fisher telephone conversation collection protocol was created at LDC to address a
critical need of developers trying to build robust automatic speech recognition (ASR)
systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II and
the resulting corpora, have been adapted for ASR research but were in fact developed
for language and speaker identification respectively. Although the CALLHOME protocol
and corpora were developed to support ASR technology, they feature small numbers of
speakers making telephone calls of relatively long duration with narrow vocabulary
across the collection. CALLHOME conversations are challengingly natural and intimate.
Under the Fisher protocol, a very large number of participants each make a few calls
of short duration speaking to other participants, whom they typically do not know,
about assigned topics. This maximizes inter-speaker variation and vocabulary breath
while also increasing formality. Previous protocols such as CALLHOME, CALLFRIEND and
Switchboard relied upon participant activity to drive the collection. Fisher is unique
in being platform driven rather than participant driven. Participants who wish to
initiate a call may do so however the collection platform initiates the majority of
calls. Participants need only answer their phones at the times they specified when
registering for the study. To encourage a broad range of vocabulary, Fisher participants
are asked to speak on an assigned topic which is selected at random from a list, which
changes every 24 hours and which is assigned to all subjects paired on that day. Some
topics are inherited or refined from previous Switchboard studies while others were
developed specifically for the Fisher protocol. In collecting data for this corpus,
attempts were made to provide a representative distribution of subjects across a variety
of demographic categories including: gender, age, dialect region, and education level.
This corpus joins other Fisher corpora: Arabic CTS Levantine Fisher Training Data
Set 3 (LDC2005S07, LDC2005T03), Fisher English Training Part 2 (LDC2005S13, LDC2005T19),
Fisher English Training Speech Part 1 (LDC2004S13, LDC2004T19), and Fisher Levantine
Arabic Conversational Telephone Speech (LDC2007S02, LDC2007T04) *Data* The speech
recordings consist of 819 telephone conversations of 10 to 12 minutes in duration.
They are provided as digital audio files in NIST SPHERE format (1024-byte ASCII file
headers). The conversations were recorded as 2-channel mu-law sample data with 8000
samples per second (as captured from the public telephone network). The accompanying
transcript files (available in Fisher Spanish - Transcripts (LDC2010T04)) are in plain-text,
tab-delimited format (tdf) with UTF-8 character encoding. They were created with the
LDC-developed transcription tool XTrans, which allowed for improved handling of multi-channel
audio and overlapping speakers. XTrans is available from LDC. Transcribers followed
LDC's Transcription Guidelines (NQTR), which are included with the documentation for
this release. The first line of each transcript file provides the column headings
the next two lines are comments that can be ignored (these are used by XTrans they
are distinguished from non-comment lines by having an initial semicolon ). Actual
transcript data, with time stamps, channel number, transcript text and additional
information, begins at line 4 of each transcript file. Native speakers of Caribbean
Spanish and non-Caribbean Spanish were recruited from within the continental United
States and Puerto Rico. The following tables provide an overview of the demographics
of the participants. The Subjects Table file, provided in the documentation, may be
used to answer questions about specific combinations of participant characteristics
(including level of participation). Participants Country Raised 47 U.S.A. 20 Argentina
14 Mexico 11 Colombia 7 Chile 6 Puerto Rico 5 Spain 5 Peru 3 Venezuela 3 Canada 3
Panama 3 Guatemala 2 Paraguay 1 Cuba 1 Honduras 1 Uruguay 1 Bolivia 1 Dominican Republic
1 Switzerland 1 Ecuador Conversation Sides Participants 1 6 2 5 3 4 4 3 5 3 6 2 7
2 8 1 9 1 10 13 11 10 12 9 13 8 14 8 15 7 16 7 17 7 18 7 19 7 20 6 21 5 22 5 23 5
24 5 Years Education Participants 2 1 4 2 5 2 6 1 11 1 12 15 13 7 14 16 15 12 16 25
17 10 18 16 19 3 20 9 21 2 22 6 23 4 24 1 25 2 28 1 Participants Dialect 91 Non-Caribbean
45 Caribbean Participants Age Group 23 Young 106 Middle 7 Old Participants Sex 84
Female 52 Male
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Spanish language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Shudong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cartagena, Ingrid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635464
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 951-452-048-245-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ACE 2005 Mandarin SpatialML Annotations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ACE 2005 Mandarin SpatialML Annotations was developed by researchers at The MITRE
Corporation (MITRE). ACE 2005 Mandarin SpatialML Annotations applies SpatialML tags
to a subset of the source Mandarin training data in ACE 2005 Multilingual Training
Corpus (LDC2006T06). Annotations for entities, relations, and events, which were included
in ACE 2005 Multilingual Training Corpus, are not included in the current SpatialML
release. For SpatialML markup to ACE 2005 English data, see ACE 2005 English SpatialML
Annotations (LDC2008T03). SpatialML is a mark-up language for representing spatial
expressions in natural language documents. SpatialML focuses is on geography and culturally-relevant
landmarks, rather than biology, cosmology, geology, or other regions of the spatial
language domain. The goal is to allow for better integration of text collections with
resources such as databases that provide spatial information about a domain, including
gazetteers, physical feature databases and mapping services. The ACE (Automatic Content
Extraction) Program seeks to develop extraction technology to support automatic processing
of source language data (in the form of natural text, and as text derived from automatic
speech recognition and optical character recognition). This includes classification,
filtering, and selection based on the language content of the source data, i.e., based
on the meaning conveyed by the data. Thus the ACE program requires the development
of technologies that automatically detect and characterize this meaning. The annotation
efforts of the ACE program supports the development of automatic content extraction
technology to support automatic processing of human language in text form. The kind
of information recognized and extracted from text includes entities, values, temporal
expressions, relations and events The SpatialML annotation scheme is intended to emulate
earlier progress on time expressions such as TIMEX2, TimeML, and the 2005 ACE guidelines.
The main SpatialML tag is the PLACE tag which encodes information about location.
The central goal of SpatialML is to map location information in text to data from
gazetteers and other databases to the extent possible by defining attributes in the
PLACE tag. Therefore, semantic attributes such as country abbreviations, country subdivision
and dependent area abbreviations (e.g., US states), and geo-coordinates are used to
help establish such a mapping. LINK and PATH tags express relations between places,
such as inclusion relations and trajectories of various kinds. Information in the
tag along with the tagged location string should be sufficient to uniquely determine
the mapping, when such a mapping is possible. This also means that redundant information
is not included in the tag. To the extent possible, SpatialML leverages ISO and other
standards towards the goal of making the scheme compatible with existing and future
corpora. The SpatialML guidelines are compatible with existing guidelines for spatial
annotation and existing corpora within the ACE research program. *Data* This corpus
consists of a 298-document subset of broadcast material from the ACE 2005 Multilingual
Training Corpus (LDC2006T06) that has been tagged by a native Mandarin speaker according
to version 2.3 of the SpatialML annotation guidelines, which are included in the documentation
for this release. * *
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content analysis (Communication)
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wang, Xiaoman
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doran, Christine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hitzeman, Janet
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mani, Inderjeet
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635391
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 958-238-545-740-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Web 5-gram Version 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Web 5-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2010T06
and isbn 1-58563-539-1, was created by researchers at Google Inc. It consists of Chinese
word n-grams and their observed frequency counts generated from over 800 million tokens
of text. The length of the n-grams ranges from unigrams (single words) to 5-grams.
This data should be useful for statistical language modeling (e.g., segmentation,
machine translation) as well as for other uses. Included with this publication is
a simple segmenter written in Perl using the same algorithm used to generate the data.
*Data Collection* N-gram counts were generated from approximately 883 billion word
tokens of text from publicly accessible web pages. This data set contains only n-grams
that appeared at least 40 times in the processed sentences. Less frequent n-grams
were discarded. While the aim was to identify and collect only Chinese language pages,
some text from other languages is incidentally included in the final data. Data collection
took place in March 2008; no text that was created on or after April 1, 2008 was used
to develop this corpus. *Preprocessing* The input character encoding of documents
was automatically detected, and all text was converted to UTF-8. The data was tokenized
by an automatic tool, and all continuous Chinese character sequences were processed
by the segmenter. The following types of tokens are considered valid: * A Chinese
word containing only Chinese characters. * Numbers, e.g., 198, 2,200, 2.3, etc. *
Single Latin tokens, such as Google, &ab, etc. *Extent of Data* * File sizes: approx.
30 GB compressed (gzip'ed) text files * Number of tokens: 882,996,532,572 * Number
of sentences: 102,048,435,515 * Number of unigrams: 1,616,150 * Number of bigrams:
281,107,315 * Number of trigrams: 1,024,642,142 * Number of fourgrams: 1,348,990,533
* Number of fivegrams: 1,256,043,325 *Sample* Sample screen shot
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Word frequency
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Linguistics
- Form subdivision:
Databases.
- General subdivision:
Statistical methods
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yang, Meng
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lin, Dekang
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635405
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 134-879-445-817-7
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
WTIMIT 1.0 is a wideband mobile telephony derivative of TIMIT Acoustic-Phonetic Continuous
Speech Corpus (TIMIT, LDC93S1). TIMIT contains wideband speech recordings (i.e., sampled
at 16 kHz) of 630 speakers in American English from eight major dialectic regions,
each reading ten phonetically rich sentences. The TIMIT speech corpus was completed
in 1993, being intended for acoustic-phonetic studies as well as for development and
evaluation of automatic speech recognition (ASR) systems. In the meantime, five TIMIT
derivatives have been developed: FFMTIMIT, NTIMIT, CTIMIT, HTIMIT, and STC-TIMIT.
The FFMTIMIT (LDC96S32) corpus (Free-Field Microphone TIMIT) consists of the original
TIMIT database, being recorded by a free-field microphone. NTIMIT (LDC93S2) (Network
TIMIT) serves as a telephone bandwidth adjunct to TIMIT, containing its speech files
transmitted over a telephone handset and the NYNEX telephone network, subject to a
large variety of channel conditions. For the cellular bandwidth speech corpus CTIMIT
(LDC96S30), the original TIMIT recordings were passed through cellular telephone circuits.
The HTIMIT (LDC98S67) corpus (Handset TIMIT) offers a TIMIT subset of 192 male and
192 female speakers through different telephone handsets for the study of telephone
transducer effects on speech. For the single-channel telephone corpus STC-TIMIT (LDC2008S03),
the TIMIT recordings were sent through a real and, in contrast to NTIMIT, single telephone
channel. While some of these derivative TIMIT corpora consist of wideband speech,
others are telephony corpora representing narrowband speech, i.e., sampled at 8 kHz
and containing frequency components from about 300 Hz to 3.4 kHz. Until now, no real-world
wideband telephony speech corpus has been publicly available. Due to upcoming wideband
speech codecs, such as G.722, G.722.1, G.722.2 (i.e., Adaptive Multi-Rate Wideband,
AMR-WB), and G.711.1, wideband telephony speech transmission is already feasible nowadays,
even in an increasing number of mobile networks. Hence, a wideband telephone bandwidth
adjunct to TIMIT is desirable for a wide range of scientific investigations, as well
as development and evaluation of systems, e.g., Interactive Voice Response (IVR) systems.
WTIMIT 1.0 (Wideband Mobile TIMIT) contains the recordings of the original TIMIT speech
files after transmission over a real 3G AMR-WB mobile network. WTIMIT 1.0 is organized
according to the original TIMIT corpus. The training subset consists of 4620 speech
files, while the test subset contains 1680 speech files. The speech format of the
WTIMIT corpus is raw (i.e., no header information) and specified as follows: * 16
kHz sampling rate * 16 bit, 1-channel linear PCM sampling format * little-endian byte
order * signed *Data* Data preparation was conducted by converting the original TIMIT
speech files into raw data (i.e., dropping the first 1024 bytes of header information)
and concatenating them to 11 signal chunks of at most 30 minutes duration. In order
to allow precise de-concatenation after transmission, and in order to be able to examine
codec influence and channel distortion, each signal chunk is preceded by a 4 s calibration
tone. It comprises 2 s of a 1 kHz sine wave followed by another 2 s of a linear sweep
from 0 to 8 kHz. After having stored the prepared speech chunks on a laptop PC, they
are ready for transmission over T-Mobile's AMR-WB-capable 3G mobile network in The
Hague, The Netherlands. At the sending end, the speech chunks were played back by
a laptop PC. Via an IEEE 1394 link (FireWire), the data was transmitted digitally
to an external DAC (digital-to-analog converter) of type RME Fireface 400. The analog
signal was then fed electrically into the microphone input of the transmitting Nokia
6220 mobile phone. For this purpose, an audio quality test cable for Nokia mobile
phones was used. Prior to the actual transmission, the output attenuation of the DAC
was adjusted such as to prevent analog saturation at the input circuit of the phone
while ensuring optimal dynamic range. Furthermore, a call to the phone at the receiving
end, a second mobile phone of type Nokia 6220, was established for each speech chunk
separately. Using the field test monitoring software of the phones, we confirmed that
they were situated in different network cells at all times during transmission; moreover,
we verified that the proper speech codec, the widely used AMR-WB at a constant data
rate of 12.65 kbit/s, was being employed. Note that this bitrate is by far the most
widely used one. Furthermore, the internal microphone equalization of the transmitting
mobile phone was switched off. At the receiving end, the analog headphone output of
the receiving mobile phone was connected electrically to an ADC (analog-to-digital
converter) of type RME Fireface 400. The analog input gain of the latter device was
adjusted once initially to exploit the dynamic range of the ADC. Sampling was performed
at a rate of 48 kHz, the native sampling rate of the ADC, and with 16 bit precision.
The digital speech signals were transferred to a laptop PC again via an IEEE 1394
link and recorded onto a hard drive. The transmitted speech chunks were decimated
from 48 kHz to 16 kHz sampling rate using a high-quality lowpass filter. Finally,
they were de-concatenated by maximizing the cross-correlation between them and the
original speech files. We followed the de-concatenation methodology of STC-TIMIT,
as described in STC-TIMIT: Generation of a Single-channel Telephone Corpus, in order
to assure a precise sample alignment to the TIMIT speech files. Hence, utterances
in WTIMIT 1.0 can be considered to be time-aligned with an average precision of 0.0625
ms (one sample) with those of TIMIT. Basically, TIMIT's original label files (*.TXT,
*.WRD, *.PHN) are valid for WTIMIT as well. However, misalignments of about 10 to
20 ms were found to be frequently produced by the channel mainly during speech pauses.
Parts of the affected speech files are therefore slightly misaligned against the original
label information. These channel effects may be related to the packet switching domain
in the UMTS Core Network. Depending on the traffic load in the network, packets are
buffered and queued, which results in a variable packet delay (jitter). If you have
any problems, questions or suggestions concerning WTIMIT, please send a brief email
to Tim Fingscheidt (Technische Universität Braunschweig, Braunschweig, Germany): fingscheidt@ifn.ing.tu-bs.de.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bauer, Patrick
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fingscheidt, Tim
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2002 pau u d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635413
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2002T43
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 433-190-013-581-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2000 HUB5 English Evaluation Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2002]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2002T43
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2000 HUB5 English Evaluation Transcripts was developed by the Linguistic Data Consortium
(LDC) and consists of transcripts of 40 English telephone conversations used in the
2000 HUB5 evaluation sponsored by NIST (National Institute of Standards and Technology).
The Hub5 evaluation series focused on conversational speech over the telephone with
the particular task of transcribing conversational speech into text. Its goals were
to explore promising new areas in the recognition of conversational speech, to develop
advanced technology incorporating those ideas and to measure the performance of new
technology. Further information about the evaluation can be found on the NIST HUB5
website. *Data* This release contains transcripts in .txt format for the 40 source
speech data files used in the evaluation: (1) 20 unreleased telephone conversations
from the Swtichboard studies in which recruited speakers were connected through a
robot operator to carry on casual conversations about a daily topic announced by the
robot operator at the start of the call; and (2) 20 telephone conversations from CALLHOME
American English Speech which consists of unscripted telephone conversations between
native English speakers. The corresponding speech files are availalbe in 2000 HUB5
English Evaluation Speech (LDC2002S09).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in . Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2002T43
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635421
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 156-627-429-482-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Treebank 7.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Treebank 7.0, Linguistic Data Consortium (LDC) catalog number LDC2010T07 and
isbn 1-58563-542-1, consists of over one million words of annotated and parsed text
from Chinese newswire, magazine news, various broadcast news and broadcast conversation
programs, web newsgroups and weblogs. The Chinese Treebank project began at the University
of Pennsylvania in 1998, continued at the University of Colorado and is now at Brandeis
University. The projects goal is to provide a large, part-of-speech tagged and fully
bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained
100,000 syntactically annotated words from Xinhua News Agency (Xinhua) newswire. It
was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and
consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05),
an updated version containing roughly 400,000 words, in 2004. A year later, the LDC
published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0
(LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 adds
new annotated newswire data, broadcast material and web text to this effort. *Data
* This release consists of 2,448 text files, 51,447 sentences, 1,196,329 words and
1,931,381 hanzi (Chinese characters). The data is provided in UTF-8 encoding and the
annnotation has Penn Treebank-style labeled brackets. Details of the annotation standard
can be found in the enclosed segmentation, POS-tagging and bracketing guidelines.
The data is provided in four different formats: raw text, word segmented, word segmented
and POS-tagged and syntactically-bracketed formats. Chinese Treebank 7.0 includes
text from the following genres and sources: Genre # words Newswire (Agence France
Presse, China News Service, Guangming Daily, Peoples Daily, Xinhua News Agency) 260,164
News Magazine (Sinorama) 256,305 Broadcast News (China Broadcasting System, China
Central TV, China National Radio, China Television System, New Tang Dynasty TV, Phoenix
TV, Voice of America) 287,442 Broadcast Conversation (Anhui TV, China Central TV,
CNN, MSNBC, New Tang Dynasty TV, Phoenix TV) 184,161 Newsgroups, Weblogs 208,257 Total
1,196,329 *Sponsorship * This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-0022. The content of this publication
does not necessarily reflect the position or the policy of the Government, and no
official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
- Geographic subdivision:
China
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jiang, Zixin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zhong, Xiuhong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chiou, Fu-Dong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chang, Meiyu
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635472
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 015-759-255-644-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2003 NIST Speaker Recognition Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2003 NIST Speaker Recognition Evaluation was developed by researchers at NIST (National
Institute of Standards and Technology). It consists of just over 120 hours of English
conversational telephone speech used as training data and test data in the 2003 Speaker
Recognition Evaluation (SRE), along with evaluation metadata and test set answer keys.
2003 NIST Speaker Recognition Evaluation is part of an ongoing series of yearly evaluations
conducted by NIST. These evaluations provide an important contribution to the direction
of research efforts and the calibration of technical capabilities. They are intended
to be of interest to all researchers working on the general problem of text independent
speaker recognition. To this end the evaluation was designed to be simple, to focus
on core technology issues, to be fully supported, and to be accessible to those wishing
to participate. This speaker recognition evaluation focused on the task of 1-speaker
and 2-speaker detection, in the context of conversational telephone speech. The evaluation
was designed to foster research progress, with the goals of: * Exploring promising
new ideas in speaker recognition. * Developing advanced technology incorporating these
ideas. * Measuring the performance of this technology. The original evaluation consisted
of three parts: 1-speaker detection "limited data", 2-speaker detection "limited data",
and 1-speaker detection "extended data". This corpus contains training and test data
and supporting metadata (including answer keys) for only the 1-speaker "limited data"
and 2-speaker "limited data" components of the original evaluation. The 1-speaker
"extended data" component of the original evaluation (not included in this corpus)
provided metadata only, to be used in conjunction with data from Switchboard-2 Phase
II (LDC99S79) and Switchboard-2 Phase III Audio (LDC2002S06). The metadata (resources
and answer keys) for the 1-speaker "extended data" component of the original 2003
SRE evaluation are available from the NIST Speech Group website for the 2003 Speaker
Recognition Evaluation. See the original evaluation plan, included with the documentation
for this corpus, for more detailed information. *Data* The data in this corpus is
a 120-hour subset of data first made available to the public as Switchboard Cellular
Part 2 Audio (LDC2004S07), reorganized (as described below) specifically for use in
the 2003 NIST SRE. For details on data collection methodology, see the documentation
for the above corpus. In the 1-speaker "limited data" component, concatenated turns
of a single side of a conversation were presented. In the 2-speaker "limited data"
component, two sides of conversation were summed together, and both the model speaker
and that speaker's conversation partner were represented in the resulting audio file.
For the 1-speaker "limited data" component, 2 minutes of concatenated turns from a
single conversation were used for training, and 15-45 seconds of concatenated turns
from a 1-minute excerpt of conversation were used for testing. For the 2-speaker "limited
data" component, three whole conversations per participant (minus some introductory
comments) were used for training, and 1-minute conversation excerpts were used for
testing. In the two-speaker detection task, the evaluation participant was required
to separate the speech of the two speakers and then decide (correctly) which side
is the model speaker. To make this challenge feasible, the training conversations
were chosen so that all speakers other than the model speaker were represented in
only one conversation. Thus the model speaker, who is represented in all three conversations,
is the only speaker to be represented in more than one conversation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635480
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 907-893-472-321-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST 2002 Open Machine Translation (OpenMT) Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2002 Open Machine Translation (OpenMT) Evaluation is a package containing source
data, reference translations, and scoring software used in the NIST 2002 OpenMT evaluation.
It is designed to help evaluate the effectiveness of machine translation systems.
The package was compiled and scoring software was developed by researchers at NIST,
making use of newswire source data and reference translations collected and developed
by LDC. The objective of the NIST OpenMT evaluation series is to support research
in, and help advance the state of the art of, machine translation (MT) technologies
-- technologies that translate text between human languages. Input may include all
forms of text. The goal is for the output to be an adequate and fluent translation
of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES
(Translingual Information Detection, Extraction) program. Beginning with the 2006
evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT.
These evaluations provide an important contribution to the direction of research efforts
and the calibration of technical capabilities in MT. The OpenMT evaluations are intended
to be of interest to all researchers working on the general problem of automatic translation
between human languages. To this end, they are designed to be simple, to focus on
core technology issues, and to be fully supported. The 2002 task was to evaluate translation
from Chinese to English and from Arabic to English. Additional information about these
evaluations may be found at the NIST Open Machine Translation (OpenMT) Evaluation
web site. *Scoring Tools* This evaluation kit includes a single perl script (mteval-v09.pl)
that may be used to produce a translation quality score for one (or more) MT systems.
The script works by comparing the system output translation with a set of (expert)
reference translations of the same source text. Comparison is based on finding sequences
of words in the reference translations that match word sequences in the system output
translation. More information on the evaluation algorithm may be obtained from the
paper detailing the algorithm: BLEU: a Method for Automatic Evaluation of Machine
Translation (Papineni et al, 2002). *Data* The Chinese-language source text included
in this corpus is a reorganization of data that was initially released to the public
as Multiple-Translation Chinese (MTC) Part 2 (LDC2003T17). The Chinese-language reference
translations are a reorganized subset of data from the same MTC corpus. The Arabic-language
data (source text and reference translations) is a reorganized subset of data that
was initially released to the public as Multiple-Translation Arabic (MTA) Part 1 (LDC2003T18).
All source data for this corpus is newswire text. Chinese source text was drawn in
March and April 2002 from Xinhua News Agency and in March 2002 from Zaobao News Service
(sources indicated in docids). Arabic source text was drawn from the Xinhua News Agency's
Arabic newswire feed (October 2001, in the docid range: artb_500 - artb_565) and Agence
France-Presse (Feb. 1998 - Oct. 1999, in the docid range: artb_001 - artb_069). Arabic
Agence France-Presse source text was also released as part of Arabic Newswire Part
1 (LDC2001T55). For details on the methodology of the source data collection and production
of reference translations, see the documentation for the above-mentioned corpora.
For each language, the test set consists of two files, a source and a reference file.
Each reference file contains four independent translations of the data set. The evaluation
year, source language, test set (which, by default, is "evalset"), version of the
data, and source vs. reference file (with the latter being indicated by "-ref") are
reflected in the file name. DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted
test data until 2008 and XML-formatted test data thereafter. The files in this package
are provided in both formats. * *
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Standard Arabic, and Arabic. Documentation in
English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635499
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 812-641-991-447-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST 2003 Open Machine Translation (OpenMT) Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2003 Open Machine Translation (OpenMT) Evaluation is a package containing source
data, reference translations, and scoring software used in the NIST 2003 OpenMT evaluation.
It is designed to help evaluate the effectiveness of machine translation systems.
The package was compiled and scoring software was developed by researchers at NIST,
making use of newswire source data and reference translations collected and developed
by LDC. The objective of the NIST OpenMT evaluation series is to support research
in, and help advance the state of the art of, machine translation (MT) technologies
-- technologies that translate text between human languages. Input may include all
forms of text. The goal is for the output to be an adequate and fluent translation
of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES
(Translingual Information Detection, Extraction) program. Beginning with the 2006
evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT.
These evaluations provide an important contribution to the direction of research efforts
and the calibration of technical capabilities in MT. The OpenMT evaluations are intended
to be of interest to all researchers working on the general problem of automatic translation
between human languages. To this end, they are designed to be simple, to focus on
core technology issues, and to be fully supported. The 2003 task was to evaluate translation
from Chinese to English and from Arabic to English. Additional information about these
evaluations may be found at the NIST Open Machine Translation (OpenMT) Evaluation
web site. *Scoring Tools* This evaluation kit includes a single perl script (mteval-v09c.pl)
that may be used to produce a translation quality score for one (or more) MT systems.
The script works by comparing the system output translation with a set of (expert)
reference translations of the same source text. Comparison is based on finding sequences
of words in the reference translations that match word sequences in the system output
translation. More information on the evaluation algorithm may be obtained from the
paper detailing the algorithm: BLEU: a Method for Automatic Evaluation of Machine
Translation (Papineni et al, 2002). The included scoring script was released with
the original evaluation, intended for use with SGML-formatted data files, and is provided
to ensure compatibility of user scoring results with results from the original evaluation.
An updated scoring software package (mteval-v13a-20091001.tar.gz), with XML support,
additional options and bug fixes, documentation, and example translations, may be
downloaded from the NIST Multimodal Information Group Tools website. *Data* The Chinese-language
and Arabic-language source text included in this corpus is a reorganization of data
that was initially released to the public respectively as Multiple-Translation Chinese
(MTC) Part 4 (LDC2006T04) and Multiple-Translation Arabic (MTA) Part 2 (LDC2005T05).
The reference translations are a reorganized subset of data from these same Multiple-Translation
corpora. All source data for this corpus is newswire text collected in January and
February of 2003 from Agence France-Presse, and Xinhua News Agency. For details on
the methodology of the source data collection and production of reference translations,
see the documentation for the above-mentioned corpora. For each language, the test
set consists of two files, a source and a reference file. Each reference file contains
four independent translations of the data set. The evaluation year, source language,
test set (which, by default, is "evalset"), version of the data, and source vs. reference
file (with the latter being indicated by "-ref") are reflected in the file name. DARPA
TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 2008 and
XML-formatted test data thereafter. The files in this package are provided in both
formats. *Sample* Sample text file containing excerpts from different xml files included
in this corpus, including reference translations and source text for a single newswire
document. The file is encoded in UTF-8.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Standard Arabic, and Arabic. Documentation in
English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635502
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 081-787-574-125-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST 2004 Open Machine Translation (OpenMT) Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2004 Open Machine Translation (OpenMT) Evaluation, is a package containing source
data, reference translations, and scoring software used in the NIST 2004 OpenMT evaluation.
It is designed to help evaluate the effectiveness of machine translation systems.
The package was compiled and scoring software was developed by researchers at NIST,
making use of newswire source data and reference translations collected and developed
by LDC. The objective of the NIST OpenMT evaluation series is to support research
in, and help advance the state of the art of, machine translation (MT) technologies
-- technologies that translate text between human languages. Input may include all
forms of text. The goal is for the output to be an adequate and fluent translation
of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES
(Translingual Information Detection, Extraction) program. Beginning with the 2006
evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT.
These evaluations provide an important contribution to the direction of research efforts
and the calibration of technical capabilities in MT. The OpenMT evaluations are intended
to be of interest to all researchers working on the general problem of automatic translation
between human languages. To this end, they are designed to be simple, to focus on
core technology issues, and to be fully supported. The 2004 task was to evaluate translation
from Chinese to English and from Arabic to English. Additional information about these
evaluations may be found at the NIST Open Machine Translation (OpenMT) Evaluation
web site. *Scoring Tools* This evaluation kit includes a single Perl script (mteval-v11a.pl)
that may be used to produce a translation quality score for one (or more) MT systems.
The script works by comparing the system output translation with a set of (expert)
reference translations of the same source text. Comparison is based on finding sequences
of words in the reference translations that match word sequences in the system output
translation. More information on the evaluation algorithm may be obtained from the
paper detailing the algorithm: BLEU: a Method for Automatic Evaluation of Machine
Translation (Papineni et al, 2002). The included scoring script was released with
the original evaluation, intended for use with SGML-formatted data files, and is provided
to ensure compatibility of user scoring results with results from the original evaluation.
An updated scoring software package (mteval-v13a-20091001.tar.gz), with XML support,
additional options and bug fixes, documentation, and example translations, may be
downloaded from the NIST Multimodal Information Group Tools website. *Data* This corpus
consists of 150 Arabic newswire documents, 150 Chinese newswire documents, and 29
Chinese "prepared speech" documents, and a corresponding set of four separate human
expert reference translations. Because LDC lacks permission to publicly distribute
some of the source text used in the original evaluation, all 50 Arabic "prepared speech"
documents and 21 of 50 Chinese "prepared speech" documents (and their corresponding
reference translations) have been removed from the current release. The reference
translations included in this corpus have not previously been publicly available.
Some of the source text in this corpus has been publicly released as part of other
LDC publications, including Arabic Gigaword Second Edition, LDC2006T02 (Agence France-Presse
(AFP) and Xinhua News Agency (Xinhua)); Chinese Gigaword Second Edition, LDC2005T14
(Xinhua, and Zaobao News Agency); Chinese Gigaword Third Edition, LDC2007T38 (AFP);
and Hong Kong Parallel Text, LDC2004T08 (Hong Kong Special Administrative Region).
The source text included in this corpus was collected from the following sources:
*Arabic* DocID prefix Source Date Document count AFA Agence France-Presse Jan. 2004
50 ALH Al Hayat Jan.-Mar. 2004 25 ANN An Nahar Feb. 2004-Mar. 2004 25 XIN Xinhua News
Agency Jan. 2004 50 *Chinese* DocID prefix Source Date Document count AFC Agence France-Presse
Jan. 2004 50 HKN Hong Kong Special Administrative Region Jan.-Mar. 2003 16 PD People's
Daily Apr. 2003-Mar. 2004 34 XIN Xinhua News Agency Oct. 2002-Jan. 2004 53 ZBN Zao
Bao News Agency Sept. 2003-Mar. 2004 26 For each language, the test set consists of
two files: a source and a reference file. Each reference file contains four independent
translations of the data set. The evaluation year, source language, test set (which,
by default, is "evalset"), version of the data, and source vs. reference file (with
the latter being indicated by "-ref") are reflected in the file name. DARPA TIDES
MT and NIST OpenMT evaluations used SGML-formatted test data until 2008 and XML-formatted
test data thereafter. The files in this package are provided in both formats. *Sample*
Sample text file containing excerpts from different xml files included in this corpus,
including reference translations and source text for a single newswire document. The
file is encoded in UTF-8. * *
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, Standard Arabic, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635510
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010V01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 778-679-274-442-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TRECVID 2004 Keyframes & Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010V01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TRECVID 2004 Keyframes and Transcripts was developed as a collaborative effort between
researchers at LDC, NIST, LIMSI-CNRS, and Dublin City University. TREC Video Retrieval
Evaluation (TRECVID) is sponsored by the National Institute of Standards and Technology
(NIST) to promote progress in content-based retrieval from digital video via open,
metrics-based evaluation. The keyframes in this release were extracted for use in
the NIST TRECVID 2004 Evaluation. TRECVID is a laboratory-style evaluation that attempts
to model real world situations or significant component tasks involved in such situations.
In 2004 there were four main tasks with associated tests: * shot boundary determination
* story segmentation * high-level feature extraction * search (interactive and manual)
For a detailed description of the TRECVID Evaluation Tasks, please refer to the NIST
TRECVID 2004 Evaluation Description. *Data * The source data includes approximately
70 hours of English language broadcast programming collected by LDC in 1998 from ABC
("World News Tonight") and CNN ("CNN Headline News"). Shots are fundamental units
of video, useful for higher-level processing. To create the master list of shots,
the video was segmented. The results of this pass are called subshots. Because the
master shot reference is designed for use in manual assessment, a second pass over
the segmentation was made to create the master shots of at least 2 seconds in length.
These master shots are the ones used in submitting results for the feature and search
tasks in the evaluation. In the second pass, starting at the beginning of each file,
the subshots were aggregated, if necessary, until the current shot was at least 2
seconds in duration, at which point the aggregation began anew with the next subshot.
The keyframes were selected by going to the middle frame of the shot boundary, then
parsing left and right of that frame to locate the nearest I-Frame. This then became
the keyframe and was extracted. Keyframes have been provided at both the subshot (NRKF)
and master shot (RKF) levels. In a small number of cases (all of them subshots) there
was no I-Frame within the subshot boundaries. When this occurred, the middle frame
was selected. There is one anomaly: at the end of the first video in the test collection,
a subshot occurs outside a master shot.) The emphasis in the common shot boundary
reference is on the shots, not the transitions. The shots are contiguous. There are
no gaps between them. They do not overlap. The media time format is based on the Gregorian
day time (ISO 8601) norm. Fractions are defined by counting pre-specified fractions
of a second. *Sample* Samples of data available in this corpus: Keyframe (video still)
Shots metadata (mp7 markup) Subshot metadata Transcript Tokenized transcript
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content-based image retrieval.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Television broadcasting of news
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Over, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Quenot, Georges
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010V01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635553
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010L01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 898-935-705-624-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010L01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 was developed by
researchers at LDC. SAMA 3.1 is based on, and updates, Buckwalter Arabic Morphological
Analyzer (BAMA) 2.0 (LDC2004L02), which was developed by Tim Buckwalter. Since this
is the first public release of SAMA, it has been numbered continuously to reflect
the continuity between this release and previous BAMA releases. SAMA 3.1 is a software
tool for the morphological analysis of Standard Arabic. SAMA 3.1 considers each Arabic
word token in all possible prefix-stem-suffix segmentations, and lists all known/possible
annotation solutions, with assignment of all diacritic marks, morpheme boundaries
(separating clitics and inflectional morphemes from stems), and all Part-of-Speech
(POS) labels and glosses for each morpheme segment. The generated output may then
be reviewed by users, and the most appropriate annotation selected from among several
choices. The software layer of SAMA 3.1 relies on a data layer that consists primarily
of three Arabic-English lexicon files: prefixes (1328 entries), suffixes (945 entries),
and stems (79318 entries representing 40654 lemmas). The lexicons are supplemented
by three morphological compatibility tables used for controlling prefix-stem combinations
(2497 entries), stem-suffix combinations (1632 entries), and prefix-suffix combinations
(1180 entries). *Differences since BAMA 2.0* The input format, output format, and
data layer of SAMA 3.1 were designed to be backward compatible with BAMA. Incremental
changes to the data layer in SAMA have resulted in: * increased lexicon coverage in
the dictionary files * important changes and additions to the inventory of POS tags
* more possible solutions generated for numerous word forms Data-layer changes are
summarized in more detail in the table_updates*.txt documentation files included in
the corpus documentation. The software implementation has been updated to allow more
input/output options, installation and configuration options, and smoother incorporation
in other Perl tools/services. The structure of the dictionary and morphotactic tables
has remained the same (the tables provided with SAMA 3.1 differ from the BAMA 2.0
tables only in size and content, not in format). Logical separation between the software
layer and data layer allows the new software tools to be used with previous versions
of the tables (instructions are provided with software documentation). The basic logic
that implements the segmentation and analysis look-up for Arabic words is essentially
unchanged since BAMA 2.0. The perldoc documentation for the SAMA.pm Perl module gives
a full account of the tokenization logic. The data layer is now accessed through Berkeley
DB, with result-caching enabled by default, leading to improved performance. Various
utility scripts have also been added to the software package to facilitate more flexible
interaction with tools and data. UTF-8 is now the default input/output and internal
character encoding, with automatic conversion of different input encodings (cp1256,
iso-8859-6, and Buckwalter transliteration are also accepted). With this change, the
use of UTF-8 as input is now fully supported, eliminating a range of problems that
would result from having to convert to cp1256 for analysis. Full details about input/output
options are provided in the SAMA.pm documentation. Further details on changes in software
options and implementation may be found in the perldoc software tool documentation,
and in the Changes*.txt documentation files. *Dependencies* There are two dependencies
for installing and using SAMA 3.1: the DB_File.pm module (available from CPAN), and
Encode::Buckwalter (included with the SAMA 3.1 distribution). The DB_File module in
turn requires that the Berkeley DB libraries be present.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bouziri, Basma
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krouna, Sondos
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kulick, Seth
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010L01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635537
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 512-715-458-848-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank: Part 1 v 4.1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Treebank: Part 1 (ATB1) v 4.1 was developed at the Linguistic Data Consortium
(LDC). It consists of 734 newswire stories from Agence France Presse (AFP) with part-of-speech
(POS), morphology, gloss and syntactic treebank annotation in accordance with the
Penn Arabic Treebank (PATB) Guidelines developed in 2008 and 2009. This release represents
a significant revision of LDCs previous ATB1 publications: Arabic Treebank: Part 1
v 2.0 LDC2003T06 and Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic
analysis) LDC2005T02. The ongoing PATB project supports research in Arabic-language
natural language processing and human language technology development. The methodology
and work leading to the release of this publication are described in detail in the
documentation accompanying this corpus and in two research papers: Enhancing the Arabic
Treebank: A Collaborative Effort toward New Annotation Guidelines and Consistent and
Flexible Integration of Morphological Annotation in the Arabic Treebank. *Data* ATB1
v 4.1 contains a total of 145,386 tokens before clitics are split, and 167,280 tokens
after clitics are separated for the treebank annotation. *Sponsorship* This work was
supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant
No. HR0011-06-1-0003. The content of this publication does not necessarily reflect
the position or the policy of the Government, and no official endorsement should be
inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kulick, Seth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gaddeche, Fatma
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mekki, Wigdan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krouna, Sondos
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bouziri, Basma
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zaghouani, Wajdi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635545
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010V02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 347-638-481-141-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TRECVID 2006 Keyframes
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010V02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TRECVID 2006 Keyframes, Linguistic Data Consortium (LDC) catalog id LDC2010V02 and
isbn 1-58563-554-5, was developed as a collaborative effort between researchers at
LDC, NIST, LIMSI-CNRS, and Dublin City University. TREC Video Retrieval Evaluation
(TRECVID) is sponsored by the National Institute of Standards and Technology (NIST)
to promote progress in content-based retrieval from digital video via open, metrics-based
evaluation. The keyframes in this release were extracted for use in the NIST TRECVID
2006 Evaluation. TRECVID is a laboratory-style evaluation that attempts to model real
world situations or significant component tasks involved in such situations. In 2006
TRECVID completed a 2-year cycle on English, Arabic, and Chinese news video. The evalaution
consisted of three system tasks and associated tests: * shot boundary determination
* high-level feature extraction * search (interactive, manually-assisted, and/or fully
automatic) The 2006 evaluation also included a rushes exploitation exploratory task,
but the material associated with that task is not included in this release. For a
detailed description of the TRECVID Evaluation Tasks, please refer to the NIST TRECVID
2006 Evaluation Description. *Data * The video stills that compose this corpus are
drawn from approximately 158.6 hours of English, Arabic, and Chinese language broadcast
programming data collected by LDC from NBC ("NBC Nightly News"), CNN ("Live From..",
"Anderson Cooper 360"), MSNBC ("MSNBC News live"), New Tang Dynsaty TV ("Economic
Frontier", "Focus Interactive"), Phoenix TV ("Good Morning China"), Lebanese Broadcasting
Corp. ("Naharkum Saiid", "News on LBC"), Alhurra TV ("Alhurra News") and China Central
TV ("CCTV_News"). Shots are fundamental units of video, useful for higher-level processing.
To create the master list of shots, the video was segmented. The results of this pass
are called subshots. Because the master shot reference is designed for use in manual
assessment, a second pass over the segmentation was made to create the master shots
of at least 2 seconds in length. These master shots are the ones to be used in submitting
results for the feature and search tasks. In the second pass, starting at the beginning
of each file, the subshots were aggregated, if necessary, until the currrent shot
was at least 2 seconds in duration, at which point the aggregation began anew with
the next subshot. The keyframes were selected by going to the middle frame of the
shot boundary, then parsing left and right of that frame to locate the nearest I-Frame.
This then became the keyframe and was extracted. Keyframes have been provided at both
the subshot (NRKF) and master shot (RKF) levels. In a small number of cases (all of
them subshots) there was no I-Frame within the subshot boundaries. When this occurred,
the middle frame was selected. The emphasis in the common shot boundary reference
is on the shots, not the transitions. The shots are contiguous. There are no gaps
between them. They do not overlap. The media time format is based on the Gregorian
day time (ISO 8601) norm. Fractions are defined by counting pre-specified fractions
of a second. In our case, the frame rate will likely be 29.97. One fraction of a second
is thus specified as "PT1001N30000F". The video id has the format of "XXX" and shot
id "shotXXX_YYY". The "XXX" is the sequence number of video onto which the video file
name is mapped this will be listed in the "collection.xml" file. The "YYY" is the
sequence number of the shot. Keyframes are identified as by a suffix "_RKF" for the
main keyframe (one per shot) or "_NKRF" for additional keyframes derived from subshots
that were merged so that shots have a minimum duration of 2 seconds. *Sample* Samples
of data available in this corpus: Keyframe (video still) Shots metadata (mp7 markup)
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Arabic, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content-based image retrieval.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Television broadcasting of news
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Over, Paul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Quenot, Georges
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010V02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 034-349-578-913-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Asian Elephant Vocalizations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Asian Elephant Vocalizations, Linguistic Data Consortium (LDC) catalog number LDC2010S05
and isbn 1-58563-557-X, consists of 57.5 hours of audio recordings of vocalizations
by Asian Elephants (Elephas maximus) in the Uda Walawe National Park, Sri Lanka, of
which 31.25 hours have been annotated. Voice recording field notes were made by Shermin
de Silva and Ashoka Ranjeewa, of the Uda Walawe Elephant Research Project. The collection
and annotation of the recordings was conducted and overseen by Shermin de Silva, through
the University of Pennsylvania Department of Biology, and Institute for Research in
Cognitive Science. The recordings primarily feature adult female, and juvenile elephants.
Existing knowledge of acoustic communication in elephants is based mostly on African
species (Loxodonta africana and Loxodonta cyclotis). There has been comparatively
less study of communication in Asian elephants, primarily becaUse the habitat in which
Asian elephants typically live makes them more difficult to study than African forest
elephants. For other current elephant vocalization research, see ElephantVoices and
the Cornell Lab of Ornithologys Elephant Listening Project. This corpus is intended
to enable researchers in acoustic communication to evaluate acoustic features and
repertoire diversity of the recorded population. Of particular interest is whether
there may be regional dialects that differ among Asian elephant populations in the
wild and in captivity. A second interest is in whether structural commonalities exist
between this and other species that shed light on underlying social and ecological
factors shaping communication systems. *Methods* *Study site and subjects* Uda Walawe
National Park (UWNP), Sri Lanka, is located at latitude 630°14.0646N, longitude 80°5428.1268E,
and an average altitude of 118 m above sea level. It occupies 308 km2 and contains
tall grassland, dense scrub, riparian forest, secondary forest, rivers and seasonal
streams. It also contains several natural and man-made water sources and reservoirs
with seasonal floodplains. There are two monsoons per calendar year, separated by
dry seasons of variable length. Over 300 adult females have been individually identified
in UWNP using characteristics of the ears, tail, and other natural markings (Moss,
1996). *Data collection* Data were collected from May, 2006 to December, 2007. Observations
were performed by vehicle during park hours from 0600 to 1830 h. Most recordings of
vocalizations were made using an Earthworks QTC50 microphone shock-mounted inside
a Rycote Zeppelin windshield, via a Fostex FR-2 field recorder (24-bit sample size,
sampling rate 48 kHz) connected to a 12 V lead acid battery. Recordings were initiated
at the start of a call with a 10-s pre-record buffer so that the entire call was captured
and loss of rare vocalizations minimized. This was made possible with the pre-record
feature of the Fostex, which records continuously, but only saves the file with a
10-second lead once the record button is depressed. In order to minimize loss of low-frequency
or potentially inaudible calls, recording was continued for at least three minutes
following the end of vocalization events. During the first two months, hour-long recording
sessions were also carried out opportunistically while in close proximity to a group.
However, spectrograms showed that few vocalizations were captured therefore, this
was discontinued. *Anomalies* Some audio files have 1 channel (field recording) and
some have 2 channels (field recordings and field notes). Certain files were recorded
at 22050 Hz sample rate: * asian_elephant_voc_d1/data/20070209/B13h00m34s09feb2007y.flac
* asian_elephant_voc_d1/data/20070209/B13h10m04s09feb2007y.flac * asian_elephant_voc_d2/data/20070405/B14h56m48s05apr2007y.flac
* asian_elephant_voc_d2/data/20070409/B14h35m11s09apr2007y.flac * asian_elephant_voc_d2/data/20070409/B14h38m34s09apr2007y.flac
* asian_elephant_voc_d2/data/20070409/B14h39m27s09apr2007y.flac Certain files were
recorded at 16 bits per sample: * asian_elephant_voc_d1/data/20070209/B13h00m34s09feb2007y.flac
* asian_elephant_voc_d1/data/20070209/B13h10m04s09feb2007y.flac * asian_elephant_voc_d2/data/20070405/B14h56m48s05apr2007y.flac
* asian_elephant_voc_d2/data/20070409/B14h35m11s09apr2007y.flac * asian_elephant_voc_d2/data/20070409/B14h38m34s09apr2007y.flac
* asian_elephant_voc_d2/data/20070409/B14h39m27s09apr2007y.flac * asian_elephant_voc_d3/data/20070507/B08h37m21s07may2007y.flac
* asian_elephant_voc_d4/data/20070822/B08h44m02s22aug2007y.flac * asian_elephant_voc_d4/data/20070822/B08h48m02s22aug2007y.flac
* asian_elephant_voc_d4/data/20071015/B12h25m22s15oct2007y.flac * asian_elephant_voc_d4/data/20071015/B12h59m51s15oct2007y.flac
* asian_elephant_voc_d5/data/20071024/B16h12m29s24oct2007y.flac One file contains
audio extracted from a video recording at 16-bit, 32 kHz. This file may overlap with
other audio recordings, but was used to aid annotation because of the density of vocalizations
and the number of vocalizing individuals: * asian_elephant_voc_d1/data_from_video/20070724/20070724_g01_vocs.flac
*Audio data annotation* Certain audio files were manually annotated, to the extent
possible, with call type (see below for a list of categories), caller id, and miscellaneous
notes. Annotations were made using the Praat TextGrid Editor, which allows spectral
analysis and annotation of audio files with overlapping events. Annotations were based
on written and audio-recorded field notes, and in some cases video recordings. Miscellaneous
notes are free-form, and include such information as distance from source, caller
identity certainty, and accompanying behavior. Audio files that are included without
a corresponding Praat TextGrid annotation file have not yet been annotated. *Acoustic
features* There are three main categories of vocalizations: those that show clear
fundamental frequencies (periodic), those that do not (a-periodic), and those that
show periodic and a-periodic regions as at least two distinct segments. Calls were
identified as belonging to one of 14 categories: Call Type Abbreviation Growl GRW
Squeak SQK Longroar-rumble LRM Longroar LRR Rumble RUM Bark-rumble BRM Trumpet TMP
Roar-rumble RRM Roar ROR Bark BRK Squeal SQL Croak-rumble CRM1 Chirp-rumble CRM2 Musth
chirp-rumble MCR *Audio compression (FLAC)* All audio wav files in this corpus have
been compressed using FLAC (Free Lossless Audio Codec). Becuase FLAC is a lossless
compression algorithm, the conversion of the included FLAC files into wav files will
result in files that are sample-for-sample identical to the original wav file recordings.
Many standard audio tools (including Praat TextGrid Editor) will transparently decompress
FLAC files, so that they may be played, processed, and examined as if they were uncompressed
audio. Should you wish to explicitly decompress FLAC files (by converting them into
wav files), there are many free audio tools capable of performing this conversion.
Some such tools, available for all major operating systems, may be found at http://flac.sourceforge.net/download.html
The data in this corpus were used by the corpus author as the foundation of a paper,
Acoustic communication in the Asian elephant, Elephas maximus maximus (S. de Silva
Behaviour, Volume 147, Number 7, 2010, pp. 825-852). If you have trouble accessing
the paper through the preceding link, you may contact the corpus author directly for
assistance. *Sample* A sample of data available in this corpus: Audio recording Praat
TextGrid Annotation
LANGUAGE NOTE
- Language note:
Content in . Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Asiatic elephant
- Form subdivision:
Databases.
- General subdivision:
Vocalization
- Geographic subdivision:
Sri Lanka
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Elephants
- Form subdivision:
Databases.
- General subdivision:
Vocalization
- Geographic subdivision:
Sri Lanka
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Elephants
- Form subdivision:
Databases.
- General subdivision:
Vocalization
- Geographic subdivision:
Asia
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Animal sounds
- Form subdivision:
Databases.
- Geographic subdivision:
Asia
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silva, Shermin de
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635561
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 048-978-532-143-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST 2005 Open Machine Translation (OpenMT) Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2005 Open Machine Translation (OpenMT) Evaluation, Linguistic Data Consortium
(LDC) catalog number LDC2010T14 and isbn 1-58563-556-1, is a package containing source
data, reference translations, and scoring software used in the NIST 2005 OpenMT evaluation.
It is designed to help evaluate the effectiveness of machine translation systems.
The package was compiled and scoring software was developed by researchers at NIST,
making use of newswire source data and reference translations collected and developed
by LDC. The objective of the NIST OpenMT evaluation series is to support research
in, and help advance the state of the art of, machine translation (MT) technologies
-- technologies that translate text between human languages. Input may include all
forms of text. The goal is for the output to be an adequate and fluent translation
of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES
(Translingual Information Detection, Extraction) program. Beginning with the 2006
evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT.
These evaluations provide an important contribution to the direction of research efforts
and the calibration of technical capabilities in MT. The OpenMT evaluations are intended
to be of interest to all researchers working on the general problem of automatic translation
between human languages. To this end, they are designed to be simple, to focus on
core technology issues, and to be fully supported. The 2005 task was to evaluate translation
from Chinese to English and from Arabic to English. Additional information about these
evaluations may be found at the NIST Open Machine Translation (OpenMT) Evaluation
web site. *Scoring Tools* This evaluation kit includes a single Perl script (mteval-v11b.pl)
that may be used to produce a translation quality score for one (or more) MT systems.
The script works by comparing the system output translation with a set of (expert)
reference translations of the same source text. Comparison is based on finding sequences
of words in the reference translations that match word sequences in the system output
translation. More information on the evaluation algorithm may be obtained from the
paper detailing the algorithm: BLEU: a Method for Automatic Evaluation of Machine
Translation (Papineni et al, 2002). The included scoring script was released with
the original evaluation, intended for use with SGML-formatted data files, and is provided
to ensure compatibility of user scoring results with results from the original evaluation.
An updated scoring software package (mteval-v13a-20091001.tar.gz), with XML support,
additional options and bug fixes, documentation, and example translations, may be
downloaded from the NIST Multimodal Information Group Tools website. *Data* This corpus
consists of 100 Arabic newswire documents, 100 Chinese newswire documents, and a corresponding
set of four separate human expert reference translations. Source text for both languages
was collected from Agence France-Presse and Xinhua News Agency in December 2004 and
January 2005. The reference translations included in this corpus have not previously
been publicly available. Arabic source text from December 2004 has been available
in LDC's Arabic Gigaword releases beginning with the Second Edition (LDC2006T02),
and from January 2005 beginning with the Third Edition (LDC2007T40). Chinese source
text from Xinhua December 2004 has been available in LDC's Chinese Gigaword releases
beginning with the Second Edition (LDC2005T14), and from Xinhua January 2005 and AFP
beginning with the Third Edition (LDC2007T38). For each language, the test set consists
of two files: a source and a reference file. Each reference file contains four independent
translations of the data set. The evaluation year, source language, test set (which,
by default, is "evalset"), version of the data, and source vs. reference file (with
the latter being indicated by "-ref") are reflected in the file name. DARPA TIDES
MT and NIST OpenMT evaluations used SGML-formatted test data until 2008 and XML-formatted
test data thereafter. The files in this package are provided in both formats. *Sample*
Sample text file containing excerpts from different xml files included in this corpus,
including reference translations and source text for a single newswire document. The
file is encoded in UTF-8.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, Standard Arabic, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635596
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 042-211-152-679-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of text/sound track or separate title:
afb
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
pes
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Asian Spoken Language Sampler
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes
a wide and growing assortment of resources for researchers, engineers and educators
whose work is concerned with human languages. Historically, most linguistic resources
were not generally available to interested researchers but were restricted to single
laboratories or to a limited number of users. Inspired by the success of selected,
readily available and well-known data sets, such as the Brown University text corpus,
LDC was founded in 1992 to provide a new mechanism for large-scale corpus development
and sharing of resources. With the support of its members, LDC is able to provide
critical services to the language research community. These services include: maintaining
the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or
web downloads, negotiating intellectual property agreements with data providers and
maintaining relations with other like-minded groups around the world. Resources available
from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons
in multiple languages, as well as software tools to facilitate the use of corpus materials.
For a complete view of LDCs publications, a searchable catalog is available at http://www.ldc.upenn.edu/Catalog/.
*Data * The Asian Spoken Language Sampler provides a variety of speech and transcript
samples from various corpora and is designed to illustrate the variety and breadth
of the speech-related resources available from LDCs Catalog. Further information about
each data set can be obtained by clicking the links in the table below. The sample
files provided in this release have been modified in various ways relative to the
original data as published by LDC: * most excerpts are truncated to be much shorter
than the original files, excerpt duration is typically one minute and thirty seconds
* signal amplitude has been adjusted where necessary to normalize playback volume
* some corpora are published in compressed form, but all samples here are uncompressed
* LDC frequently uses NIST SPHERE file format for audio data, but the audio files
in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility
with typical browser audio utilities. 2005 NIST Language Recognition Evaluation The
goal of the NIST Language Recognition Evaluation is to establish the baseline of current
performance capability for language recognition of conversational telephone speech
and to lay the groundwork for further research efforts in the field. 2007 NIST Language
Recognition Evaluation Test Set The most significant differences between previous
NIST evaluations and the 2007 task were the increased number of languages and dialects,
the greater emph asis on a basic detection task for evaluation and the variety of
evaluation conditions. ARL Urdu Speech Database, Training Data The ARL Urdu Speech
Database is a collection of recorded speech from 200 adult native Urdu speakers from
Pakistan and Northern India. CALLFRIEND Farsi A corpus of 60 unscripted telephone
calls between friends and acquaintances speaking in their native language, Farsi.
CALLFRIEND Tamil A corpus of 60 unscripted telephone calls between friends and acquaintances
speaking in their native language, Tamil. CALLFRIEND Vietnamese A corpus of 60 unscripted
telephone calls between friends and acquaintances speaking in their native language,
Vietnamese. CALLHOME Japanese A corpus of 120 unscripted telephone conversations between
native Japanese speakers and a corpus of associated transcripts. CALLHOME Mandarin
Chinese Speech The Callhome Mandarin Chinese corpus of telephone speech consists of
120 unscripted telephone conversations between native speakers of Mandarin Chinese.
JEIDA/JCSD-Channel 0 Mono Syllables This collection consists of high-fidelity recordings
of 150 native speakers of Japanese each speaker produces four repetitions of 323 short
prompts, including city names, control words, monosyllabic words, isolated digits
and strings of four digits. Each reading session was recorded with two microphones.
Korean Telephone Conversations Speech and Transcripts This publication consists of
100 telephone conversations, 49 of which were published in 1996 as Callfriend Korean,
while the rest of 51 are previously unexposed calls. All 100 conversations have been
transcribed. Mandarin Affective Speech Mandarin Affective Speech is a database of
emotional speech consisting of audio recordings and corresponding transcripts collected
in 2005 at the Advance Computing and System Laboratory, Zhejiang University. The speech
database was recorded by eliciting speakers to express different emotional states
in response to stimuli. Russian through Switched Telephone Network (RuSTeN) The purpose
of the project was to develop software for automatic identification of speakers based
on voice samples acquired through telephone channels. TDT4 Multilingual Broadcast
News Speech Corpus This release contains the complete set of American English, Modern
Standard Arabic and Mandarin Chinese broadcast news audio used in the 2002 and 2003
Topic Detection and Tracking technology evaluations. West Point Korean Speech West
Point Korean Speech is a database of digital recordings of spoken Korean. The prompt
scripts were created from 20,000 distinct sentences, along with a subset of prompts
designed to elicit free response answers to questions for use in domain-specific translation
systems. Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations
and transcripts from speakers of several nationalities. Gulf Arabic Conversational
Telephone Speech Contains 975 telephone conversations from speakers across the Persian
Gulf region and their transcriptions. *How to Obtain the Sampler * The Asian Spoken
Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most
compression utilities will readily extract the sampler. Download 28 mb
LANGUAGE NOTE
- Language note:
Content in Yue Chinese, Vietnamese, Urdu, Tamil, Russian, Korean, Japanese, Hindi,
Persian, Mandarin Chinese, North Levantine Arabic, South Levantine Arabic, Gulf Arabic,
Dari, and Iranian Persian. Documentation in English.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 895-206-642-518-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Message Understanding Conference 7 Timed (MUC7_T)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Message Understanding Conference 7 Timed (MUC7_T), Linguistic Data Consortium (LDC)
catalog number LDC2010T15 and isbn 1-58563-560-X, was developed by researchers at
Jena University Language & Information Engnineering (JULIE) Lab, Friedrich-Schiller-Universität
Jena, Germany. It is a re-annotation of a portion of the MUC7 corpus (Linguistic Data
Consortium, LDC2001T02), which consists of New York Times news stories annotated for
use in the Message Understanding Conference 7 (MUC7) evaluation. The series of MUC
evaluations in the 1990s focused on emerging information extraction technologies.
Further information about NIST's MUC7 evaluation can be found MUC project website.
MUC7_T consists of 100 articles from the MUC7 corpus training set reannotated for
named entities (persons, locations and organizations) with a time stamp indicating
the time measured for the linguistic decision making process. The corpus was developed
for two principal purposes: for use in evaluations of selective sampling strategies,
such as Active Learning; and to create predictive models for annotation costs. The
annotation was performed by two advanced students of linguistics with good English
language skills who followed the the original guidelines of the MUC7 named entity
task (which can be found in the online documentation for the MUC7 corpus). *Data *
The data is stored in XML format. There is an element anno_example for each annotation
example that has the original MUC7 document as text context. The MUC7 document was
tokenized using the Stanford Tokenizer3 with white spaces marking token boundaries.
The tokenizer is part of the Stanford Parser package which can be obtained from The
Stanford Natural Language Processing Group. The following attributes are used for
the element anno_example: Attribute Explanation anno_time The time it took to annotate
the annotation unit of this annotation example (time in milliseconds). anno_unit_tokens
All tokens of the annotation unit. anno_unit_offset Offsets for the tokens of the
annotation unit relative to all tokens in the annotation example. anno_unit_labels
Labels for the tokens of the annotation unit (these labels are taken from MUC7). doc_id
ID of the document of the annotation example. sent_id ID of the sentence of the annotation
example. anno_unit_id ID of the unit of the annotation example. muc7_org_filename
The name of the original MUC7 document from which this annotation example is taken.
*Dirctory Structure * The directory structure of the corpus is as follows: data: This
subdirectory contains the MUC7_T data; the data for annotator A and B are in separate
folders. For each annotator, there is a version of MUC7_T with CNP-level and with
sentence-level annotations. docs: This subdirectory contains detailed documentation
as well as publications describing applications of MUC7_T. There is also a small JavaDoc
for the Java tools (see the tools subdirectory below). dtd: This subdirectory contains
the Document Type Definition (DTD) for the data files. tools: This subdirectory contains
a small Java API which allows users to read the MUC7_T XML data so that each annotation
example is represented by a Java object. The API incudes the source code and a jar
package. The source code has been tested with Java 1.5 and Java 1.6.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Information retrieval
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tomanek, Katrin
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u ben d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635618
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 651-762-041-881-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
ben
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ben
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Indian Language Part-of-Speech Tagset: Bengali
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Indian Language Part-of-Speech Tagset: Bengali, Linguistic Data Consortium (LDC) catalog
number LDC2010T16 and isbn 1-58563-561-8, is a corpus developed by Microsoft Research
(MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven
linguistic research on Indian Languages in general. It is created as a part of the
Indian Language Part-of-Speech Tagset (IL-POST) project, a collaborative effort among
linguists and computer scientists from MSR India, AU-KBC (Anna Universtiy, Chennai),
Delhi University, IIT Bombay, Jawaharlal Nehru University (Delhi) and Tamil University
(Tamilnadu). The goal of the IL-POST project is to provide a common tagset framework
for Indian Languages that offers flexibility, cross-linguistic compatibility and resuability
across those languages. It supports a three-level hierarchy of Categories, Types and
Attributes. The corpus mainly consists therefore of two different levels of information
for each lexical token: (a) lexical Category and Types, and (b) set morphological
attributes and their associated values in the context. Bengali (also referred to as
Bangla) is a member of the Eastern Indo-Aryan language group. It is native to the
region of Bengal which consists of Bangladesh, the Indian state of West Bengal, and
parts of the Indian states of Tripura and Assam. It is spoken by more than 210 million
people as a first or a second language with around 100 million speakers in Bangladesh,
about 85 million speakers in India, and others in immigrant communities in the United
Kingdom, USA and the Middle East. *Data* This corpus contains 7168 sentences (102933
words) of manually annotated text from modern standard Bengali sources including blogs,
Wikipedia, Multikulti and a portion of the EMILLE/CIIL corpus . The annotated data
is structured into two folders, Bangla1 (3684 sentences, 51091 words) and Bangla2
(3484 sentences, 51842 words), which represent the two stages in which the data was
annotated. All annotated data is provided in both xml and text files. Each data file
contains between 3,000-5,000 words. The XML file contains metadata about the material,
such as language, encoding and data size. *Annotation Procedure* The Annotation Guidelines
for Bangla included in this release contain a detailed description of the annotation
methodology.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Bengali. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Bengali language
- Form subdivision:
Databases.
- General subdivision:
Parts of speech
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Bengali language
- Form subdivision:
Databases.
- General subdivision:
Word frequency
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bali, Kalika
ADDED ENTRY--PERSONAL NAME
- Personal name:
Choudhury, Monojit
ADDED ENTRY--PERSONAL NAME
- Personal name:
Biswas, Priyanka
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635626
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 714-470-511-952-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST 2006 Open Machine Translation (OpenMT) Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2006 Open Machine Translation (OpenMT) Evaluation, Linguistic Data Consortium
(LDC) catalog number LDC2010T17 and isbn 1-58563-562-6, is a package containing source
data, reference translations and scoring software used in the NIST 2006 OpenMT evaluation.
It is designed to help evaluate the effectiveness of machine translation systems.
The package was compiled and scoring software was developed by researchers at NIST,
making use of broadcast, newswire and web newsgroup source data and reference translationns
collected and developed by LDC. The objective of the NIST Open Machine Translation
(OpenMT) evaluation series is to support research in, and help advance the state of
the art of, machine translation (MT) technologies -- technologies that translate text
between human languages. Input may include all forms of text. The goal is for the
output to be an adequate and fluent translation of the original. The MT evaluation
series started in 2001 as part of the DARPA TIDES (Translingual Information Dectection,
Extraction) program. Beginning with the 2006 evaluation, the evaluations have been
driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important
contribution to the direction of research efforts and the calibration of technical
capabilities in MT. The OpenMT evaluations are intended to be of interest to all researchers
working on the general problem of automatic translation between human languages. To
this end, they are designed to be simple, to focus on core technology issues and to
be fully supported. The 2006 task was to evaluate translation from Arabic to English
and from Chinese to English. Additional information about these evaluatoins may be
found at the NIST Open Machine Translation (OpenMT) Evaluation web site. *Scoring
Tools * This evaluation kit includes a single Perl script (mteval-v11b.pl) that may
be used to produce a translation quality score for one (or more) MT systems. The script
works by comparing the system output translation with a set of (expert) reference
translations of the same source text. Comparison is based on finding sequences of
words in the reference translations that match word sequences in the system output
translation. More information on the evaluation algorithm may be obtained from the
paper detailing the algorithm: BLEU: a Method for Automatic Evaluation of Machine
Translation (Papineni et al, 2002). The included scoring script was released with
the original evaluation, intended for use with SGML-formatted data files, and is provided
to ensure compatibility of user scoring results with results from the original evaluation.
An updated scoring software package (mteval-v13a-20091001.tar.gz), with XML support,
additional options and bug fixes, documentation, and example translations, may be
downloaded from the NIST Multimodal Information Group Tools website. *Data * This
release contains of 357 documents with corresponding sets of four separate human expert
reference translations The source data is comprised of Arabic and Chinese newswire
documents, human transcriptions of broadcast news and broadcast conversation programs
and web newsgroup documents collected by LDC in 2006. The newswire and broadcast material
are from Agence France-Presse (Arabic, Chinese), Xinhua News Agency (Arabic, Chinese),
Lebanese Broadcasting Corp. (Arabic), Dubai TV (Arabic), China Central TV (Chinese)
and New Tang Dynasty Television (Chinese). The web text was collected from Google
andYahoo newsgroups. For each language, the test set consists of two files: a source
and a reference file. Each file contains four independent translations of the data
set. The evaluation year, source language, test set (which, by default, is evalset),
version of the data, and source vs. reference file (with the latter being indicated
by -ref) are reflected in the file name. A reference file contains four independent
reference translations unless noted otherwise in the accompanying README.txt DARPA
TIDES and NIST OpenMT evaluations used SGML-formatted test data until 2008 and XML-formatted
test data thereafter. This files in this package are provided in both formats.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, Arabic, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u kor d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635642
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 415-772-920-718-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kor
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Korean Newswire Second Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Korean Newswire Second Edition, Linguistic Data Consortium (LDC) catalog number LDC2010T19
and isbn 1-58563-564-2, is an archive of Korean newswire text that has been acquired
over several years (1994-2009) at LDC from the Korean Press Agency. This release includes
all of the content of Korean Newswire LDC2000T45 (June 1994-March 2000) as well as
newly-collected data. *New in the Second Edition * The second edition contains all
data collected by LDC from April 2000 through December 2009. All material, including
that from the first release, has been converted to UTF-8 (except for more recent data
already in UTF-8 format) and processed in LDCs gigaword format. The gigaword format
classifies newswire content into three types: story, multi and other where story refers
to an article containing information pertaining to a particular event on a day multi
refers to an article that contains more than one story relating to different topics
and other refers to articles containing lists, tables or numerical data, such as sports
scores. A word break error in the original release and in data collected from January
2002 through February 2005 has been corrected in the second edition with the result
that all Korean text should display correctly. The error involved a line break in
the middle of a word with the result that an affected word appeared in segments in
two lines. This problem was resolved using word histograms and a few common rules
based on heuristics from the data and has yielded a 90% - 95% word break correction
rate. Further information about the word break correction procedure is available in
Word_Break_Correction_Procedure.txt. The following table shows for each gigaword classification,
the number of documents in the classification (# DOCS), the number of space-separated
word tokens in the text (K-WORDS) and the uncompressed file size in kilobytes (TextKB):
# DOCS K-WORDS TextKB story 217052 37546 371722 multi 31 21 239 other 7318 1034 8375
*Data * The directory structure of the corpus is as follows: . |-common_files |---docs
|---dtd |-kor_nw_p1v2 |---data data: This directory contains the corpus files. Each
file contains data collected during the course of a month. For example, the filename
kpa_kor_199406 contains data collected in June 1994. Each document in a file has a
fixed sgml structure governed by a dtd. The SGML tagging is as follows: Consult the
dtd for more information regarding the sgml structure of a single article. Not all
articles have information in all the tag fields. The dtd mandates that every article
must have a DOC tag and a BODY tag. The HEADLINE, DATELINE and P tags are optional.
Within the units, tagging is kept to a minimum, typically consisting only of tags
to mark paragraph boundaries. The unique KPA_KOR_yyyymmdd.nnnn string in the DOC tag
: is intepreted in the manner described below. yyyy = Year mm = Month dd = Day nnnn
= Sequence NumberFor all articles that share the same yyyymmdd docid string, the nnnn
substring ensures that the docid is unique in the corpus. docs: Contains corpus documentation.
dtd: Contains the dtd for the corpus.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Korean. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
- Geographic subdivision:
Korea (South)
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mendonça, Ângelo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cole, Andy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635634
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 719-248-950-383-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0, Linguistic Data Consortium
(LDC) catalog number LDC2010T18 and isbn 1-58563-563-4, was developed by researchers
at The MITRE Corporation. It contains the English evaluation data prepared for the
2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by
the Automatic Content Extraction (ACE) program, specifically, English broadcast news
and newswire data collected by LDC. The training data for this evaluation can be found
in ACE Time Normalization (TERN) 2004 English Training Data v 1.0 LDC2005T07. The
purpose of the TERN evaluation is to advance the state of the art in the automatic
recognition and normalization of natural language temporal expressions. In most language
contexts such expressions are indexical. For example, with "Monday," "last week,"
or "three months starting October 1," one must know the narrative reference time in
order to pinpoint the time interval being conveyed by the expression. In addition,
for data exchange purposes, it is essential that the identified interval be rendered
according to an established standard, i.e., normalized. Accurate identification and
normalization of temporal expressions are in turn essential for the temporal reasoning
being demanded by advanced NLP applications such as question answering, information
extraction and summarization. *Data* The data in this release is English broadcast
transcripts and newswire material from TDT4 Multilingual Text and Annotations LDC2005T16.
The annotation specifications for this corpus were developed under DARPA's Translingual
Information Detection Extraction and Summarization (TIDES) program, with support from
ACE. All files have been doubly-annotated by two separate annotators and then reconciled,
using the TIDES 2003 Standard for the Annotation of Temporal Expressions (included
in this release). The table below illustrates the number of words and documents by
genre: Words Documents Broadcast news 26418 127 Newswire 28196 65 TOTAL 54614 192
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Natural language processing (Computer science)
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ferro, Lisa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gerber, Laurie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mani, Inderjeet
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sundheim, Beth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wilson, George
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s1993 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635650
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC93S1W
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 916-193-657-488-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TIMIT Acoustic-Phonetic Continuous Speech (MS-WAV version)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[1993]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC93S1W
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
This version of the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) has
all the waveform files formatted with ms-wav / RIFF headers, to make the corpus more
accessible to a wider audience. The TIMIT corpus of read speech is designed to provide
speech data for acoustic-phonetic studies and for the development and evaluation of
automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers
of eight major dialects of American English, each reading ten phonetically rich sentences.
The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions
as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design
was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International
(SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed
at MIT and verified and prepared for CD-ROM production by the National Institute of
Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified.
Test and training subsets, balanced for phonetic and dialectal coverage, are specified.
Tabular computer-searchable information is included as well as written documentation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lamel, Lori F.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fisher, William M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pallett, David S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dahlgren, Nancy L.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zue, Victor
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC93S1W
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u urd d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635677
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 415-534-082-867-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST 2008 Open Machine Translation (OpenMT) Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2008 Open Machine Translation (OpenMT) Evaluation, Linguistic Data Consortium
(LDC) catalog number LDC2010T21 and isbn 1-58563-567-7, is a package containing source
data, reference translations and scoring software used in the NIST 2008 OpenMT evaluation.
It is designed to help evaluate the effectiveness of machine translation systems.
The package was compiled and scoring software was developed by researchers at NIST,
making use of broadcast, newswire and web data and reference translations collected
and developed by LDC. The objective of the NIST Open Machine Translation (OpenMT)
evaluation series is to support research in, and help advance the state of the art
of, machine translation (MT) technologies -- technologies that translate text between
human languages. Input may include all forms of text. The goal is for the output to
be an adequate and fluent translation of the original. The MT evaluation series started
in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction)
program. Beginning with the 2006 evaluation, the evaluations have been driven and
coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution
to the direction of research efforts and the calibration of technical capabilities
in MT. The OpenMT evaluations are intended to be of interest to all researchers working
on the general problem of automatic translation between human languages. To this end,
they are designed to be simple, to focus on core technology issues and to be fully
supported. The 2008 task was to evaluate translation from Arabic to English, Chinese
to English, English to Chinese (newswire only) and Urdu to English. Selected human
reference translations and system translations for the NIST MT08 test sets are contained
in NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System
Translations LDC2010T01. Additional information about these evaluations may be found
at the NIST Open Machine Translation (OpenMT) Evaluation web site. *Scoring Tools*
This evaluation kit includes a single Perl script (mteval-v11b.pl) that may be used
to produce a translation quality score for one (or more) MT systems. The script works
by comparing the system output translation with a set of (expert) reference translations
of the same source text. Comparison is based on finding sequences of words in the
reference translations that match word sequences in the system output translation.
More information on the evaluation algorithm may be obtained from the paper detailing
the algorithm: BLEU: a Method for Automatic Evaluation of Machine Translation (Papineni
et al, 2002). The included scoring script was released with the original evaluation,
intended for use with SGML-formatted data files, and is provided to ensure compatibility
of user scoring results with results from the original evaluation. An updated scoring
software package (mteval-v13a-20091001.tar.gz), with XML support, additional options
and bug fixes, documentation, and example translations, may be downloaded from the
NIST Multimodal Information Group Tools website. *Data* This release contains 494
documents with corresponding sets of four separate human expert reference translations.
The source data is comprised of Arabic, Chinese, English and Urdu newswire, broadcast
and weblog and newsgroup data collected by LDC in 2007. The newswire and broadcast
material are from Asharq Al-Awsat (Arabic), Agence France-Presse (Arabic, Chinese,
English), Al-Ahram (Arabic), Al Hayat (Arabic), Assabah (Arabic), An Nahar (Arabic),
Al-Quds Al-Arabi (Arabic), Xinhua News Agency (Arabic, Chinese, English), Central
News Service (Chinese), Guangming Daily (Chinese), People's Daily (Chinese), People's
Liberation Army Daily (Chinese), British Broadcasting Corporation (Urdu), Daily Jang
(Urdu), Pakistan News Service (Urdu), Voice of America (Urdu), Associated Press (English),
New York Times (English) and Los Angeles Times/Washington Post Newswire Service (English).
For each language, the test set consists of two files: a source and a reference file.
Each file contains four independent translations of the data set. The evaluation year,
source language, test set (which, by default, is "evalset"), version of the data,
and source vs. reference file (with the latter being indicated by "-ref") are reflected
in the file name. A reference file contains four independent reference translations
unless noted otherwise in the accompanying README.txt. DARPA TIDES MT and NIST OpenMT
evaluations used SGML-formatted test data until 2008 and XML-formatted test data thereafter.
This files in this package are povided in both formats.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Urdu, English, Mandarin Chinese, Arabic, and Chinese. Documentation in
English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Urdu language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u urd d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635707
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 264-294-098-796-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST 2009 Open Machine Translation (OpenMT) Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2009 Open Machine Translation (OpenMT) Evaluation, Linguistic Data Consortium
(LDC) catalog number LDC2010T23 and isbn 1-58563-570-7, is a package containing source
data, reference translations and scoring software used in the NIST 2009 OpenMT evaluation.
It is designed to help evaluate the effectiveness of machine translation systems.
The package was compiled and scoring software was developed by researchers at NIST,
making use of broadcast, newswire and web data and reference translations collected
and developed by LDC. The objective of the NIST Open Machine Translation (OpenMT)
evaluation series is to support research in, and help advance the state of the art
of, machine translation (MT) technologies -- technologies that translate text between
human languages. Input may include all forms of text. The goal is for the output to
be an adequate and fluent translation of the original. The MT evaluation series started
in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction)
program. Beginning with the 2006 evaluation, the evaluations have been driven and
coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution
to the direction of research efforts and the calibration of technical capabilities
in MT. The OpenMT evaluations are intended to be of interest to all researchers working
on the general problem of automatic translation between human languages. To this end,
they are designed to be simple, to focus on core technology issues and to be fully
supported. The 2009 task was to evaluate translation from Arabic to English and Urdu
to English. Additional information about these evaluations may be found at the NIST
Open Machine Translation (OpenMT) Evaluation web site. *Scoring Tools * This evaluation
kit includes a single Perl script (mteval-v11b.pl) that may be used to produce a translation
quality score for one (or more) MT systems. The script works by comparing the system
output translation with a set of (expert) reference translations of the same source
text. Comparison is based on finding sequences of words in the reference translations
that match word sequences in the system output translation. More information on the
evaluation algorithm may be obtained from the paper detailing the algorithm: BLEU:
a Method for Automatic Evaluation of Machine Translation (Papineni et al, 2002). The
included scoring script is intended for use with SGML-formatted data files. An updated
scoring software package (mteval-v13a-20091001.tar.gz), with XML support, additional
options and bug fixes, documentation, and example translations, may be downloaded
from the NIST Multimodal Information Group Tools website. *Data * This release contains
373 documents with corresponding sets of four separate human expert reference translations.
The source data is comprised of Arabic and Urdu broadcast, newswire and weblog data
collected by LDC in 2007 and 2009. The newswire and broadcast material are from Asharq
Al-Awsat (Arabic), Agence France-Presse (Arabic), Al-Ahram (Arabic), Al Hayat (Arabic),
Assabah (Arabic), An Nahar (Arabic), Al-Quds Al-Arabi (Arabic), Xinhua News Agency
(Arabic), British Broadcasting Corporation (Urdu), Deutsche Welle (Urdu), Mehr News
Agency (Urdu) and Voice of America (Urdu). For each language, the test set consists
of two files: a source and a reference file. Each file contains four independent translations
of the data set. The evaluation year, source language, test set (which, by default,
is evalset), version of the data, and source vs. reference file (with the latter being
indicated by -ref) are reflected in the file name. A reference file contains four
independent reference translations unless noted otherwise in the accompanying README.txt.
DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 2008
and XML-formatted test data thereafter. This files in this package are provided in
both formats.
LANGUAGE NOTE
- Language note:
Content in Urdu and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Urdu language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u hin d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635715
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 115-406-051-155-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
hin
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
hin
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Indian Language Part-of-Speech Tagset: Hindi
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Indian Language Part-of-Speech Tagset: Hindi, Linguistic Data Consortium (LDC) catalog
number LDC2010T24 and isbn 1-58563-571-5, is a corpus developed by Microsoft Research
(MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven
linguistic research on Indian Languages in general. It is created as a part of the
Indian Language Part-of-Speech Tagset (IL-POST) project, a collaborative effort among
linguists and computer scientists from MSR India, AU-KBC (Anna University, Chennai),
Delhi University, IIT Bombay, Jawaharlal Nehru University (Delhi) and Tamil University
(Tamilnadu). The goal of the IL-POST project is to provide a common tagset framework
for Indian Languages that offers flexibility, cross-linguistic compatibility and reusability
across those languages. It supports a three-level hierarchy of Categories, Types and
Attributes. The corpus mainly consists therefore of two different levels of information
for each lexical token: (a) lexical Category and Types, and (b) set morphological
attributes and their associated values in the context. Hindi is the official language
of India and a member of the Indo-Aryan language group. It is spoken mainly in the
northern states of Rajasthan, Delhi, Haryana, Uttarakhand, Uttar Pradesh, Madhya Pradesh,
Chhattisgarh, Himachal Pradesh, Jharkhand and Bihar as well as in much of central
India and in communities in Africa, Australia, New Zealand, the Middle East, Europe
and North America. Hindi is the first or second language of more than 500 million
people. *Data * This corpus contains 4859 sentences (98,450 words) of manually annotated
Hindi text randomly collected from the Microsoft Hindi Research Corpus, sourced from
the publisher WebDunia. All annotated data is provided in both xml and text files.
The xml files are contained in the "XML_files" folder and the text files in the "text_files"
folder. Each data file contains between 900-5,000 words. The XML file contains metadata
about the material, such as language, encoding and data size. *Annotation Procedure
* The Annotation Guidelines for Hindi, included in this release, contain a detailed
description of the annotation methodology. The Annotation Tool Guideline 1.0, also
included in this publication, describes the annotation interface developed for the
IL-POST framework; the tool is not included in this corpus.
LANGUAGE NOTE
- Language note:
Content in Hindi. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bali, Kalika
ADDED ENTRY--PERSONAL NAME
- Personal name:
Choudhury, Monojit
ADDED ENTRY--PERSONAL NAME
- Personal name:
Biswas, Priyanka
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jha, Girish Nath
ADDED ENTRY--PERSONAL NAME
- Personal name:
Choudhary, Narayan Kumar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sharma, Maansi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2010 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635693
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2010T22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 461-028-050-892-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Manually Annotated Sub-Corpus First Release
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2010]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2010T22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Manually Annotated Sub-Corpus First Release (MASC I), Linguistic Data Consortium
(LDC) catalog number LDC2010T22 and isbn 1-58563-569-3, is the first of three releases
of 500,000 words of MASC data developed as part of the American National Corpus (ANC)
project. MASC I consists of approximately 80,000 words of contemporary spoken and
written American English annotated for a variety of linguistic phenomena. The MASC
project is sponsored by the National Science Foundation and was established to address,
to the extent possible, many of the obstacles to the creation of large-scale, robust,
multiply-annotated corpora of English covering a wide range of genres of written and
spoken language data. Researchers from Vassar College, Columbia University and the
International Computer Science Institute, University of California at Berkeley are
the principal participants the WordNet project provides consulting. The source texts
in MASC I are drawn from the open portion of the American National Corpus (ANC) Second
Release LDC2005T35, which includes written texts and spoken transcripts of American
English from a broad range of genres produced since 1990 and from the Language Understanding
Annotation Corpus LDC2009T09, (LU Corpus), a collection of various genres including
broadcast, newswire, email and telephone speech annotated for committed belief, event
and entity coreference, dialog acts and temporal relations. All of the words of data
in MASC I have validated annotations for token, part of speech, sentence boundary,
noun chunks, verb chunks, named entities and Penn Treebank syntax. Full-text FrameNet
annotations are available for seventeen texts and WordNet word sense annotations are
available for 1000 occurrences of each of fifty-three words. Annotations of all or
portions of the sub-corpus for a wide variety of other linguistic phenomena have been
contributed by other projects. Software and services available from the ANC project
website enable transduction of MASC into a wide variety of physical formats. *Data
* The MASC directory contains two folders: masc-1.0.3 and masc_wordsense. masc-1.0.3
contains the actual MASC corpus and consists of two folders, spoken and written. The
spoken folder contains data and annotations for spoken material, and the written folder
contains the same for written texts. The files in each of the respective folders have
naming conventions that describe the contents of the file. masc_wordsense contains
the MASC sentence samples with word sense annotations using WordNet sense numbers
as the annotation values.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ide, Nancy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Suderman, Keith
ADDED ENTRY--PERSONAL NAME
- Personal name:
Baker, Collin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Passonneau, Rebecca
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fellbaum, Christiane
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2010T22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635723
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 365-198-419-802-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages,
Linguistic Data Consortium (LDC) catalog number LDC2011T01 and isbn 1-58563-572-3,
is a subset of OntoNotes Release 2.0 LDC2008T04 used in SemEval-2010 Task 1, Coreference
Resolution in Multiple Languages. OntoNotes Release 2.0 consists of roughly 500,000
words of English broadcast and newswire data annotated with structural information
(syntax and predicate argument structure) and shallow semantics (word sense linked
to an ontology and coreference). This SemEval-2010 Task 1 release contains approximately
120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task.
SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational
semantic analysis systems. The goal of SemEval-2010 Task 1 was to evaluate and compare
automatic coreference resolution systems for six languages (Catalan, Dutch, English,
German, Italian and Spanish) in four evaluation settings using four metrics. Further
information about Task 1 can be found on the task description website. The task organizers
included researchers from Universitat de Barcelona (Spain), Universitat Politècnica
de Catalunya (Spain), University of Essex (United Kingdom), Universita di Trento (Italy),
Hogeschool Gent (Netherlands), University of Tübingen (Germany) and Stanford University
(USA). *Data* The data is divided into three sets: the development set (*/data/en.devel.txt)
which contains 39 documents, 741 sentences and 17,044 tokens; the training set (*/data/en.train.txt)
which contains 229 documents, 3,648 sentences and 79,060 tokens; and the test set
(*/data/en.test.txt) which contains 85 documents, 1,141 sentences and 24,206 tokens.
The complete material for training systems is the sum of the development and training
sets. Details of the SemEval task formatting applied to the data can be found in the
documentation file, en.info.txt. *Scorer* The official scorer is available from the
the task download page.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Recasens, Marta
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marquez, Lluis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sapena, Emili
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martí, M. Antònia
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taulé, Mariona
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635731
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 912-956-774-503-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ACE 2005 English SpatialML Annotations Version 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ACE 2005 English SpatialML Annotations Version 2, Linguistic Data Consortium (LDC)
catalog number LDC2011T02 and isbn 1-58563-573-1, was developed by researchers at
The MITRE Corporation and applies SpatialML tags to the English newswire and broadcast
training data annotated for entities, relations and events in ACE 2005 Multilingual
Training Corpus LDC2006T06. This second version eliminates a number of annotation
inconsistencies and errors identified in ACE 2005 English SpatialML Annotations LDC2008T03.
In addition, the SpatialML annotation schema has been updated from version 2.0 to
version 3.0.1 the revised annotation guidelines are included in this release. The
ACE (Automatic Content Extraction) program focused on developing automatic content
extraction technology to support automatic processing of human language in text form.,
specifically, entities, values, temporal expressions, relations and events. SpatialML
is a mark-up language for representing spatial expressions in natural language documents.
It is intended to emulate earlier progress on time expression such as TIMEX2, TimeML,
and the 2005 ACE guidelines. SpatialML includes syntax for marking up PLACEs mentioned
in text and for linking them to data from gazetteers and other databases. LINKs are
used to express relations between places, and RLINKs to capture trajectories for relative
locations. To the extent possible, SpatialML leverages ISO and other standards with
the goal of making the scheme compatible with existing and future corpora. SpatialML
goes beyond these schemes, however, in terms of providing a richer markup for natural
language that includes semantic features and relationships that allow mapping to existing
resources such as gazetteers. Such markup can be useful for disambiguation, integration
with mapping services and spatial reasoning. *Data * This corpus contains 210065 total
words and 17821 unique words. Counts of unique words can be found in doc/ldc_wordcount.csv
which includes all words that are not part of XML markup (e.g., without tag names,
attribute names or values). Unique words are counted by comparing case insensitive
transformations with preceding and trailing punctuation stripped off. Words consisting
solely of punctuation are discarded. The principal change in the annotation schema
is that PATH has been generalized to RLINK for relative link. At the top level, there
is now a version attribute on the root SpatialML tag to capture which version of SpatialML
was used. A number of smaller changes have been made to the annotation specification
these are listed in Section 2 of the updated guidelines. The files are provided in
both in-line xml format and aif format. The gaz-deref files contain multiple gazetteer
references when they exist for a single location these different gazrefs sometimes
correspond to slightly different latlongs. The sgm.dtd validated files do not contain
document structure tags (such as , ) that would prevent them from being validated
with the SpatialML DTD. These files total 22624650 bytes uncompressed.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doran, Christine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mani, Inderjeet
ADDED ENTRY--PERSONAL NAME
- Personal name:
Clancy, Seamus
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hitzeman, Janet
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 272-858-321-100-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
OntoNotes Release 4.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
OntoNotes Release 4.0, Linguistic Data Consortium (LDC) catalog number LDC2011T03
and isbn 1-58563-574-X, was developed as part of the OntoNotes project, a collaborative
effort between BBN Technologies, the University of Colorado, the University of Pennsylvania
and the University of Southern Californias Information Sciences Institute. The goal
of the project is to annotate a large corpus comprising various genres of text (news,
conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows)
in three languages (English, Chinese, and Arabic) with structural information (syntax
and predicate argument structure) and shallow semantics (word sense linked to an ontology
and coreference). OntoNotes Release 4.0 is supported by the Defense Advance Research
Project Agency, GALE Program Contract No. HR0011-06-C-0022. OntoNotes Release 4.0
contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21, OntoNotes
Release 2.0 LDC2008T04 and OntoNotes Release 3.0 LDC2009T24 -- and adds newswire,
broadcast news, broadcast conversation and web data in English and Chinese and newswire
data in Arabic. This cumulative publication consists of 2.4 million words as follows:
300k words of Arabic newswire 250k words of Chinese newswire, 250k words of Chinese
broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese
web text and 600k words of English newswire, 200k word of English broadcast news,
200k words of English broadcast conversation and 300k words of English web text. The
OntoNotes project builds on two time-tested resources, following the Penn Treebank
for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation
will include word sense disambiguation for nouns and verbs, with each word sense connected
to an ontology, and coreference. The current goals call for annotation of over a million
words each of English and Chinese, and half a million words of Arabic over five years.
*Data* Documents describing the annotation guidelines and the routines for deriving
various views of the data from the database are included in the documentation directory
of this release. The annotation is provided both in separate text files for each annotation
layer (Treebank, PropBank, word sense, etc.) and in the form of an integrated relational
database (ontonotes-v4.0.sql.gz) with a Python API to provide convenient cross-layer
access. *Tools* This release includes OntoNotes DB Tool v0.999 beta, the tool used
to assemble the database from the original annotation files. It can be found in the
directory ontonotes-db-tool-v0.999b. This tool can be used to derive various views
of the data from the database, and it provides an API that can implement new queries
or views. Licensing information for the OntoNotes DB Tool package is included in its
source directory.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Arabic, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Weischedel, Ralph
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitchell
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hovy, Eduard
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pradhan, Sameer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ramshaw, Lance
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taylor, Ann
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kaufman, Jeff
ADDED ENTRY--PERSONAL NAME
- Personal name:
Franchini, Michelle
ADDED ENTRY--PERSONAL NAME
- Personal name:
El-Bachouti, Mohammed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Belvin, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Houston, Ann
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u san d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635758
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 837-079-335-566-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
san
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
san
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Indian Language Part-of-Speech Tagset: Sanskrit
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Indian Language Part-of-Speech Tagset: Sanskrit, Linguistic Data Consortium (LDC)
catalog number LDC2011T04 and isbn 1-58563-575-8, is a corpus developed by Microsoft
Research (MSR) India to support the task of Part-of-Speech Tagging (POS) and other
data-driven linguistic research on Indian Languages in general. It is created as a
part of the Indian Language Part-of-Speech Tagset (IL-POST) project, a collaborative
effort among linguists and computer scientists from MSR India, AU-KBC (Anna University,
Chennai), Delhi University, IIT Bombay, Jawaharlal Nehru University (Delhi) and Tamil
University (Tamilnadu). The goal of the IL-POST project is to provide a common tagset
framework for Indian Languages that offers flexibility, cross-linguistic compatibility
and resuability across those languages. It supports a three-level hierarchy of Categories,
Types and Attributes. The corpus mainly consists therefore of two different levels
of information for each lexical token: (a) lexical Category and Types, and (b) set
morphological attributes and their associated values in the context. Sanskrit is the
classical language of Indian and the oldest documented language of the Indo-European
language family. It is also the liturgical language of Hinduism, Buddhism and Jainism
and one of the twenty-two official languages of India. The name Sanskrit means refined,
consecrated and sanctified. *Data * This corpus contains 3,703 sentences (57,218 words)
of manually annotated Sanskrit text selected from the Panchatrantra stories, a collection
of animal fables in verse and prose dating from the third century BCE. All annotated
data is provided in both xml and text files. The xml files are contained in the XML_files
folder and the text files in the text_files folder. Each data file contains between
12,000-45,000 words. The XML file contains metadata about the material, such as language,
encoding and data size. *Annotation Procedure * The paper, Annotating Sanskrit corpus:
adapting IL-POSTS included in this release, contains a detailed description of the
annotation methodology. *Sample*
LANGUAGE NOTE
- Language note:
Content in Sanskrit. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jha, Girish Nath
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gopal, Madhav
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mishra, Diwakar
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635758
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 111-667-828-386-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set,
Linguistic Data Consortium (LDC) catalog number LDC2011T05 and isbn 1-58563-575-8,
is a package containing source data, reference translations, machine translations
and associated human judgments used in the NIST 2008 and 2010 MetricsMaTr evaluations.
The package was compiled by researchers at NIST, making use of Arabic and Chinese
broadcast, newswire and web data and reference translations collected and developed
by LDC for Phase 2 and Phase 2.5 of the DARPA GALE program. NIST MetricsMaTr is a
series of research challenge events for machine translation (MT) metrology, promoting
the development of innovative MT metrics that correlate highly with human assessments
of MT quality. Participants submit their metrics to NIST (National Institute of Standards
and Technology). NIST runs those metrics on certain held-back test data for which
it has human assessments measuring quality and then calculates correlations between
the automatic metric scores and the human assessments. Specifically, the goals of
MetricsMATR are: to inform other MT technology evaluation campaigns and conferences
with regard to improved metrology to establish an infrastructure that encourages the
development of innovative metrics to build a diverse community that will bring new
perspectives to MT metrology research and to provide a forum for MT metrology discussion
and for establishing future directions of MT metrology. The first MetricsMaTr challenge
was held in 2008 the development data from the 2008 program is available from LDC,
2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data LDC2009T05.
The MetricsMaTr10 evaluation plan is included in this release. *Data* This release
contains 149 documents with corresponding reference translations (Arabic-to-English
and Chinese-to-English), system translations and human assessments. The human assessments
include the following: Adequacy7 (a 7-point scale for judging the meaning of a system
translation with respect to the reference translation) Adequacy Yes/No (whether the
given system segment meant essentially the same as the reference translation) Preference
(the judges preference between two candidate translations when compared to a human
reference translation) and HTER (Human Targeted Error Rate, human edits to a system
translation to have the same meaning as a reference translation).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, Arabic, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Newspapers
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635766
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011V01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 921-255-128-642-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part
1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011V01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part
1, Linguistic Data Consortium (LDC) catalog number LDC2011V01 and isbn 1-58563-576-6,
was developed by researchers at the Department of Computer Science and Engineering,
University of South Florida (USF), Tampa, Florida and the Multimodal Information Group
at the National Institute of Standards and Technology (NIST). It contains approximately
fifteen hours of meeting room video data collected in 2001 and 2002 at NISTs Meeting
Data Collection Laboratory and annotated for the VACE (Video Analysis and Content
Extraction) 2005 face, person and hand detection and tracking tasks. The VACE program
was established to develop novel algorithms for automatic video content extraction,
multi-modal fusion, and event understanding. During VACE Phases I and II, the program
made significant progress in the automated detection and tracking of moving objects
including faces, hands, people, vehicles and text in four primary video domains: broadcast
news, meetings, street surveillance, and unmanned aerial vehicle motion imagery. Initial
results were also obtained on automatic analysis of human activities and understanding
of video sequences. Three performance evaluations were conducted under the auspices
of the VACE program between 2004 and 2007. The 2005 evaluation was administered by
USF in collaboration with NIST and guided by an advisory forum including the evaluation
participants. A summary of results of the evaluation can be found in them 2005 VACE
results and analysis paper included in this release. *Data* NISTs Meeting Data Collection
Laboratory is designed to collect corpora to support research, development and evaluation
in meeting recognition technologies. It is equipped to look and sound like a conventional
meeting space. The data collection facility includes five Sony EV1-D30 video cameras,
four of which have stationary views of a center conference table (one view from each
surrounding wall) with a fixed focus and viewing angle, and an addtional floating
camera which is used to focus on particular participants, whiteboard or conference
table depending on the meeting forum. The data is captured in a NIST-internal file
format. The video data was extracted from the NIST format and encoded using the MPEG-2
standard in NTSC format. Further information concerning the video data parameters
can found in the documentation included with this corpus. *Tools* The VACE evaluation
tools have been integrated into NISTs downloadable Framework for Detection Evaluation
(F4DE) Toolkit. The toolkit contains small example files for each of the task/object/domain
scoring combinations.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content-based image retrieval.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Human face recognition (Computer science)
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Optical pattern recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Digital video
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kasturi, Rangachar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Goldgof, Dmitry
ADDED ENTRY--PERSONAL NAME
- Personal name:
Manohar, Vasant
ADDED ENTRY--PERSONAL NAME
- Personal name:
Soundararajan, Padmanabhan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bowers, Rachel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rose, Travis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Michel, Martial
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011V01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635774
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011V02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 901-755-963-423-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part
2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011V02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part
2, Linguistic Data Consortium (LDC) catalog number LDC2011V02 and isbn 1-58563-577-4,
was developed by researchers at the Department of Computer Science and Engineering,
University of South Florida (USF), Tampa, Florida and the Multimodal Information Group
at the National Institute of Standards and Technology (NIST). It contains approximately
fourteen hours of meeting room video data collected in 2001 and 2002 at NISTs Meeting
Data Collection Laboratory and annotated for the VACE (Video Analysis and Content
Extraction) 2005 face, person and hand detection and tracking tasks. LDC has previously
released NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training
Set Part 1 LDC2011V01. The VACE program was established to develop novel algorithms
for automatic video content extraction, multi-modal fusion, and event understanding.
During VACE Phases I and II, the program made significant progress in the automated
detection and tracking of moving objects including faces, hands, people, vehicles
and text in four primary video domains: broadcast news, meetings, street surveillance,
and unmanned aerial vehicle motion imagery. Initial results were also obtained on
automatic analysis of human activities and understanding of video sequences. Three
performance evaluations were conducted under the auspices of the VACE program between
2004 and 2007. The 2005 evaluation was administered by USF in collaboration with NIST
and guided by an advisory forum including the evaluation participants. A summary of
results of the evaluation can be found in the 2005 VACE results and analysis paper
included in this release. *Data * NISTs Meeting Data Collection Laboratory is designed
to collect corpora to support research, development and evaluation in meeting recognition
technologies. It is equipped to look and sound like a conventional meeting space.
The data collection facility includes five Sony EV1-D30 video cameras, four of which
have stationary views of a center conference table (one view from each surrounding
wall) with a fixed focus and viewing angle, and an addtional floating camera which
is used to focus on particular participants, whiteboard or conference table depending
on the meeting forum. The data is captured in a NIST-internal file format. The video
data was extracted from the NIST format and encoded using the MPEG-2 standard in NTSC
format. Further information concerning the video data parameters can found in the
documentation included with this corpus. *Tools* The VACE evaluation tools have been
integrated into NISTs downloadable Framework for Detection Evaluation (F4DE) Toolkit.
The toolkit contains small example files for each of the task/object/domain scoring
combinations.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content-based image retrieval.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Human face recognition (Computer science)
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Optical pattern recognition
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Digital video
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kasturi, Rangachar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Goldgof, Dmitry
ADDED ENTRY--PERSONAL NAME
- Personal name:
Manohar, Vasant
ADDED ENTRY--PERSONAL NAME
- Personal name:
Soundararajan, Padmanabhan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bowers, Rachel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rose, Travis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Michel, Martial
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011V02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635782
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 990-903-200-829-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Broadcast News Lattices
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Broadcast News Lattices, Linguistic Data Consortium (LDC) catalog number LDC2011T06
and isbn 1-58563-578-2, was developed by researchers at Microsoft and Johns Hopkins
Unviersity (JHU) for the Johns Hopkins 2010 Summer Workshop on Speech Recognition
with Conditional Random Fields. The lattices were generated using the IBM Attila speech
recognition toolkit and were derived from transcripts of approximately 400 hours of
English broadcast news recordings. They are intended to be used for training and decoding
with Microsofts segmental CRF toolkit for speech recogntion, SCARF. The goal of the
JHU 2010 workshop was to advance the state-of-the-art in core speech recognition by
developing new kinds of features for use in a Segmental Conditional Random Field (SCRF).
The SCRF approach generalizes Condtional Random Fields to operate at the segment level,
rather than at the traditional frame level. Every segment is labeled directly with
a word. Features are then extracted which each measure some form of consistency between
the underlying audio and the word hypothesis for a segment. These are combined in
a log-linear model (lattice) to produce the posterior possibility of a word sequence
given the audio. *Data * Broadcast News Lattices consists of training and test material,
the source data for which was taken from various corpora distributed by LDC. Training
Data The training lattices total 152251 and were derived from the following data sets:
1996 English Broadcast News Speech LDC97S44 1996 English Broadcast News Transcripts
(HUB4) LDC97T22 (104 hours) 1997 English Broadcast News Speech (HUB4) LDC98S71 1997
English Broadcast News Transcripts (HUB4) LDC98T28 (97 hours) TDTD4 Multilingual Broadcast
News Speech Corpus LDC2005S11 TDT4 Multilingual Text and Annotations LDC2005T16 (300
hours) The lattices can be related to the original audio files via the file train.db.gz
which lists for each segment a tag-name, segment number, the original audio file,
channel (always 0), start time, and end time (in seconds). A sample line is as follows:
19960510_NPR_ATC#Ailene_Leblanc 0001 19960510_NPR_ATC.sph 0 76.767 89.404 | This sample
line corresponds to the release lattice labeled: 19960510_NPR_ATC#Ailene_Leblanc@0001.dc
The file train.Bdc contains denominator lattices. The file train.Bnc has the numerator
lattices containing the subset of paths consistent with the training transcriptions.
The file train.Btr consists of the transcriptions. The file train.Bbase contains the
baseline (one-best) word detections from the Attila system. The lattices were generated
from an acoustic model that included LDA+MLLT, VTLN, fMLLR based SAT training, fMMI
and mMMI discriminative training, and MLLR. The lattices are annotated with a field
indicating the results of a second confirmatory decoding made with an independent
speech recognizer. When there was a correspondence between a lattice link and the
1-best secondary output, the link was annotated with +1. Silence links are denominated
with 0 and all others with -1. Correspondence was computed by finding the midpoint
of a lattice link and comparing the link label with that of the word in the secondary
decoding at that position. Thus, there are some cases where the same word shifted
slightly in time receives a different confirmation score. Test Data The test lattices
are derived from the English broadcast news material in 2003 NIST Rich Transcription
Evaluation Data LDC2007S10. Bbase and Bdc files are provided, along with the db file
rt03.db.gz to link the segments to times in the original waveform. Scoring scripts
may be obtained from the NIST Rich Transcription website. *SCARF Toolkit* The SCARF
toolkit is available for download from the SCARF website. *Related Publications* A
full description of the lattice generation process can be found in Zweig et al., Speech
Recognition with Segmental Conditional Random Fields: Final Report from the 2010 JHU
Summer Workshop, MSR Technical Report MSR-TR-2010-173.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zweig, Geoffrey
ADDED ENTRY--PERSONAL NAME
- Personal name:
Karakos, Damianos
ADDED ENTRY--PERSONAL NAME
- Personal name:
Nguyen, Patrick
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635790
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011V03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 330-650-936-529-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011V03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1,
Linguistic Data Consortium (LDC) catalog number LDC2011V03 and isbn 1-58563-579-0,
was developed by researchers at the Department of Computer Science and Engineering,
University of South Florida (USF), Tampa, Florida and the Multimodal Information Group
at the National Institute of Standards and Technology (NIST). It contains approximately
eleven hours of meeting room video data collected in 2001 and 2002 at NISTs Meeting
Data Collection Laboratory and used in the VACE (Video Analysis and Content Extraction)
2005 evaluation. The VACE program was established to develop novel algorithms for
automatic video content extraction, multi-modal fusion, and event understanding. During
VACE Phases I and II, the program made significant progress in the automated detection
and tracking of moving objects including faces, hands, people, vehicles and text in
four primary video domains: broadcast news, meetings, street surveillance, and unmanned
aerial vehiclevmotion imagery. Initial results were also obtained on automaticvanalysis
of human activities and understanding of video sequences. Three performance evaluations
were conducted under the auspices of the VACE program between 2004 and 2007. The 2005
evaluation was administered by USF in collaboration with NIST and guided by an advisory
forum including the evaluation participants. A summary of results of the evaluation
can be found in the 2005 VACE results and analysis paper included in this release.
*Data * NISTs Meeting Data Collection Laboratory is designed to collect corpora to
support research, development and evaluation in meeting recognition technologies.
It is equipped to look and sound like a conventional meeting space. The data collection
facility includes five Sony EV1-D30 video cameras, four of which have stationary views
of a center conference table (one view from each surrounding wall) with a fixed focus
and viewing angle, and an addtional floating camera which is used to focus on particular
participants, whiteboard or conference table depending on the meeting forum. The data
is captured in a NIST-internal file format. The video data was extracted from the
NIST format and encoded using the MPEG-2 standard in NTSC format. Further information
concerning the video data parameters can found in the documentation included with
this corpus. *Tools * The VACE evaluation tools have been integrated into NISTs downloadable
Framework for Detection Evaluation (F4DE) Toolkit. The toolkit contains small example
files for each of the task/object/domain scoring combinations.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content-based image retrieval.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Digital video
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kasturi, Rangachar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Goldgof, Dmitry
ADDED ENTRY--PERSONAL NAME
- Personal name:
Manohar, Vasant
ADDED ENTRY--PERSONAL NAME
- Personal name:
Soundararajan, Padmanabhan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bowers, Rachel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rose, Travis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Michel, Martial
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011V03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635804
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 778-313-260-404-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2005 NIST Speaker Recognition Evaluation Training Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2005 NIST Speaker Recognition Evaluation Training Data, Linguistic Data Consortium
(LDC) catalog number LDC2011S01 and isbn 1-58563-579-0, was developed at LDC and NIST
(National Institute of Standards and Technology). It consists of 392 hours of conversational
telephone speech in English, Arabic, Mandarin Chinese, Russian and Spanish and associated
English transcripts used as training data in the NIST-sponsored 2005 Speaker Recognition
Evaluation (SRE). The ongoing series of SRE yearly evaluations conducted by NIST are
intended to be of interest to researchers working on the general problem of text independent
speaker recognition. To that end the evaluations are designed to be simple, to focus
on core technology issues, to be fully supported and to be accessible to those wishing
to participate. The task of the 2005 SRE evaluation was speaker detection, that is,
to determine whether a specified speaker is speaking during a given segment of conversational
speech. The task was divided into 20 distinct and separate tests involving one of
five training conditions and one of four test conditions. Further information about
the task conditions is contained in the The NIST Year 2005 Speaker Recognition Evaluation
Plan. *Data * The speech data consists of conversational telephone speech with multi-channel
data collected simultaneously from a number of auxiliary microphones. The files are
organized into two segments: 10 second two-channel excerpts (continuous segments from
single conversations that are estimated to contain approximately 10 seconds of actual
speech in the channel of interest) and 5 minute two-channel conversations. The speech
files are stored as 8-bit u-law speech signals in separate SPHERE files. In addition
to the standard header fields, the SPHERE header for each file contains some auxiliary
information that includes the language of the conversation and whether the data was
recorded over a telephone line. English language word transcripts in .cmt format were
produced using an automatic speech recognition system (ASR)with error rates in the
range of 15-30%.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish, Russian, English, Mandarin Chinese, and Arabic. Documentation
in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech processing systems.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635812
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 911-942-430-413-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English Gigaword Fifth Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
English Gigaword Fifth Edition, Linguistic Data Consortium (LDC) catalog number LDC2011T07
and isbn 1-58563-515-4, is a comprehensive archive of newswire text data that has
been acquired over several years by the LDC at the University of Pennsylvania. The
fifth edition includes all of the contents in English Gigaword Fourth Edition (LDC2009T13)
plus new data covering the 24-month period of January 2009 through December 2010.
The seven distinct international sources of English newswire included in this edition
are the following: * Agence France-Presse, English Service (afp_eng) * Associated
Press Worldstream, English Service (apw_eng) * Central News Agency of Taiwan, English
Service (cna_eng) * Los Angeles Times/Washington Post Newswire Service (ltw_eng) *
Washington Post/Bloomberg Newswire Service (wpb_eng) * New York Times Newswire Service
(nyt_eng) * Xinhua News Agency, English Service (xin_eng) The seven letter codes in
the parentheses above include the three-character source name abbreviations and the
three-character langauge code (eng) separated by an underscore (_) character. The
three-letter language code conforms to LDCs internal convention based on the ISO 639-3
standard. *Data* The following table sets forth the overall totals for each source.
Note that Total-MB refers to the quantity of date when unzipped (approximately 26
gigabytes), Gzip-MB refers to compressed file sizes as stored on the DVD-ROMs and
K-wrds refers to the number of whitespace-separated tokens (of all types) after all
SGML tags are eliminated: Source #Files Gzip-MB Totl-MB K-wrds #DOCs afp_eng 146 1732
4937 738322 2479624 apw_eng 193 2700 7889 1186955 3107777 cna_eng 144 86 261 38491
145317 ltw_eng 127 651 1694 268088 411032 nyt_eng 197 3280 8938 1422670 1962178 wpb_eng
12 42 111 17462 26143 xin_eng 191 834 2518 360714 1744025 TOTAL 1010 9325 26348 4032686
9876086 *Sponsorship* This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication
does not necessarily reflect the position or policy of the Government, and no official
endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Parker, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635820
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 494-554-511-556-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Datasets for Generic Relation Extraction (reACE)
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Datasets for Generic Relation Extraction (reACE) was developed at The University of
Edinburgh, Edinburgh, Scotland. It consists of English broadcast news and newswire
data originally annotated for the ACE (Automatic Content Extraction) program to which
the Edinburgh Regularized ACE (reACE) mark-up has been applied. The Edinburgh relation
extraction (RE) task aims to identify useful information in text (e.g., PersonW works
for OrganisationX, GeneY encodes ProteinZ) and to recode it in a format such as a
relational database or RDF triple store (a database for the storage and retreival
of Resource Description Framework (RDF) metadata) that can be more effectively used
for querying and automated reasoning. A number of resources have been developed for
training and evaluation of automatic systems for RE in different domains. However,
comparative evaluation is impeded by the fact that these corpora use different markup
formats and different notions of what constitutes a relation. reACE solves this problem
by converting data to a common document type using token standoff and including detailed
linguistic markup while maintaining all information in the original annotation. The
subsequent reannotation process normalises the two data sets so that they comply with
a notion of relation that is intuitive, simple and informed by the semantic web. The
data in this corpus consists of newswire and broadcast news material from ACE 2004
Multilingual Training Corpus LDC 2005T09 and ACE 2005 Multilingual Training Corpus
LDC2006T06. This material has been standardised for evaluation of multi-type RE across
domains. Complete documentation for this corpus is available at the publication providers
web site Datasets for Generic Relation Extraction. *Data* Annotation includes (1)
a refactored version of the original data to a common XML document type (2) linguistic
information from LT-TTT (a system for tokenizing text and adding markup) and MINIPAR
(an English parser) and (3) a normalised version of the original RE markup that complies
with a shared notion of what constitutes a relation across domains. The data sources
represented in the corpus were collected by LDC in 2000 and 2003 and consist of the
following: ABC, Agence France Presse, Associated Press, Cable News Network, MSNBC/NBC,
New York Times, Public Radio International, Voice of America and Xinhua News Agency.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content analysis (Communication)
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hachey, Benjamin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grover, Claire
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tobin, Richard
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635839
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 560-424-742-579-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2006 NIST Spoken Term Detection Development Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2006 NIST Spoken Term Detection Development Set, Linguistic Data Consortium (LDC)
catalog number LDC2011S02 and isbn 1-58563-583-9, was compiled by researchers at NIST
(National Institute of Standards and Technology) and contains approximately eighteen
hours of Arabic, Chinese and English broadcast news, English conversational telephone
speech and English meeting room speech used in NISTs 2006 Spoken Term Detection (STD)
evaluation. The STD initiative is designed to facilitate research and development
of technology for retrieving information from archives of speech data with the goals
of exploring promising new ideas in spoken term detection, developing advanced technology
incorporating these ideas, measuring the performance of this technology and establishing
a community for the exchange of research results and technical insights. The 2006
STD task was to find all of the occurrences of a specified term (a sequence of one
or more words) in a given corpus of speech data. The evaluation was intended to develop
technology for rapidly searching very large quantities of audio data. Although the
evaluation used modest amounts of data, it was structured to simulate the very large
data situation and to make it possible to extrapolate the speed measurements to much
larger data sets. Therefore, systems were implemented in two phases: indexing and
searching. In the indexing phase, the system processes the speech data without knowledge
of the terms. In the searching phase, the system uses the terms, the index, and optionally
the audio to detect term occurrences. *Data* The development corpus consists of three
data genres: broadcast news (BNews), conversational telephone speech (CTS) and conference
room meetings (CONFMTG). The broadcast news material was collected in 2001 by LDCs
broadcast collection system from the following sources: ABC (English), China Broadcasting
System (Chinese), China Central TV (Chinese), China National Radio (Chinese), China
Television System (Chinese), CNN (English), MSNBC/NBC (English), Nile TV (Arabic),
Public Radio International (English) and Voice of America (Arabic, Chinese, English).
The CTS data was taken from the Switchboard data sets (e.g., Switchboard-2 Phase 1
LDC98S75, Switchboard-2 Phase 2 LDC99S79) and the Fisher corpora (e.g., Fisher English
Training Speech Part 1 LDC2004S13), also collected by LDC. The conference room meeting
material consists of goal-oriented, small group roundtable meetings and was collected
in 2001, 2004 and 2005 by NIST, the International Computer Science Institute (Berkely,
California), Carnegie Mellon University (Pittsburgh, PA) and Virginia Polytechnic
Institute and State University (Blacksburg, VA) as part of the AMI corpus project.
Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted file. CTS
recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files. The CONFMTG
files contain a single recorded channel.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Standard Arabic, and Arabic. Documentation in
English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635847
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 244-296-223-213-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2006 NIST Spoken Term Detection Evaluation Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2006 NIST Spoken Term Detection Evaluation Set, Linguistic Data Consortium (LDC) catalog
number LDC2011S03 and isbn 1-58563-584-7, was compiled by researchers at NIST (National
Institute of Standards and Technology) and contains approximately eighteen hours of
Arabic, Chinese and English broadcast news, English conversational telephone speech
and English meeting room speech used in NISTs 2006 Spoken Term Detection (STD) evaluation.
The STD initiative is designed to facilitate research and development of technology
for retrieving information from archives of speech data with the goals of exploring
promising new ideas in spoken term detection, developing advanced technology incorporating
these ideas, measuring the performance of this technology and establishing a community
for the exchange of research results and technical insights. The 2006 STD task was
to find all of the occurrences of a specified term (a sequence of one or more words)
in a given corpus of speech data. The evaluation was intended to develop technology
for rapidly searching very large quantities of audio data. Although the evaluation
used modest amounts of data, it was structured to simulate the very large data situation
and to make it possible to extrapolate the speed measurements to much larger data
sets. Therefore, systems were implemented in two phases: indexing and searching. In
the indexing phase, the system processes the speech data without knowledge of the
terms. In the searching phase, the system uses the terms, the index, and optionally
the audio to detect term occurrences. The development data is available in 2006 NIST
Spoken Term Detection Development Set LDC2011S02. *Data* The evaluation corpus consists
of three data genres: broadcast news (BNews), conversational telephone speech (CTS)
and conference room meetings (CONFMTG). The broadcast news material was collected
in 2003 and 2004 by LDCs broadcast collection system from the following sources: ABC
(English), Aljazeera (Arabic), China Central TV (Chinese), CNN (English), CNBC (English),
Dubai TV (Arabic), New Tang Dynasty TV (Chinese), Public Radio International (English)
and Radio Free Asia (Chinese). The CTS data was taken from the Switchboard data sets
(e.g., Switchboard-2 Phase 1 LDC98S75, Switchboard-2 Phase 2 LDC99S79) and the Fisher
corpora (e.g., Fisher English Training Speech Part 1 LDC2004S13), also collected by
LDC. The conference room meeting material consists of goal-oriented, small group roundtable
meetings and was collected in 2004 and 2005 by NIST, the International Computer Science
Institute (Berkeley, California), Carnegie Mellon University (Pittsburgh, PA), TNO
(The Netherlands) and Virginia Polytechnic Institute and State University (Blacksburg,
VA) as part of the AMI corpus project. This evaluation corpus includes scoring software.
It uses the inputs described in the STD Evaluation plan to complete the evaluation
of a system. Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted
file. CTS recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files.
The CONFMTG files contain a single recorded channel.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Standard Arabic, and Arabic. Documentation in
English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635855
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011V04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 881-155-690-205-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011V04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2,
Linguistic Data Consortium (LDC) catalog number LDC2011V04 and isbn 1-58563-585-5,
was developed by researchers at the Department of Computer Science and Engineering,
University of South Florida (USF), Tampa, Florida and the Multimodal Information Group
at the National Institute of Standards and Technology (NIST). It contains approximately
thirteen hours of meeting room video data collected in 2001 and 2002 at NISTs Meeting
Data Collection Laboratory and used in the VACE (Video Analysis and Content Extraction)
2005 evaluation. The VACE program was established to develop novel algorithms for
automatic video content extraction, multi-modal fusion, and event understanding. During
VACE Phases I and II, the program made significant progress in the automated detection
and tracking of moving objects including faces, hands, people, vehicles and text in
four primary video domains: broadcast news, meetings, street surveillance, and unmanned
aerial vehicle motion imagery. Initial results were also obtained on automatic analysis
of human activities and understanding of video sequences. Three performance evaluations
were conducted under the auspices of the VACE program between 2004 and 2007. The 2005
evaluation was administered by USF in collaboration with NIST and guided by an advisory
forum including the evaluation participants. LDC has previously released NIST/USF
Evaluation Resources for the VACE Program -- Meeting Data Training Set Part 1 LDC2011V01,
NIST/USF Evaluation Resources for the VACE Program -- Meeting Data Training Set Part
2 LDC2011V02 and NIST/USF Evaluation Resources for the VACE Program -- Meeting Data
Test Set Part 1 LDC2011V03. *Data* NISTs Meeting Data Collection Laboratory is designed
to collect corpora to support research, development and evaluation in meeting recognition
technologies. It is equipped to look and sound like a conventional meeting space.
The data collection facility includes five Sony EV1-D30 video cameras, four of which
have stationary views of a center conference table (one view from each surrounding
wall) with a fixed focus and viewing angle, and an additional floating camera which
is used to focus on particular participants, whiteboard or conference table depending
on the meeting forum. The data is captured in a NIST-internal file format. The video
data was extracted from the NIST format and encoded using the MPEG-2 standard in NTSC
format. Further information concerning the video data parameters can found in the
documentation included with this corpus. Note: due to a last moment update, the file
lists on the published media are inaccurate. For up to date lists, please see the
online documentation for this corpus.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content-based image retrieval.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Digital video
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kasturi, Rangachar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Goldgof, Dmitry
ADDED ENTRY--PERSONAL NAME
- Personal name:
Manohar, Vasant
ADDED ENTRY--PERSONAL NAME
- Personal name:
Soundararajan, Padmanabhan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bowers, Rachel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rose, Travis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Michel, Martial
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011V04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635863
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 276-308-572-849-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2005 NIST Speaker Recognition Evaluation Test Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2005 Speaker Recognition Evaluation Test Data, Linguistic Data Consortium (LDC)
catalog number LDC2011S04 and isbn 1-58563-586-3, was developed at LDC and NIST (National
Institute of Standards and Technology). It consists of 525 hours of conversational
telephone speech in English, Arabic, Mandarin Chinese, Russian and Spanish and associated
English transcripts used as test data in the NIST-sponsored 2005 Speaker Recognition
Evaluation (SRE). The ongoing series of SRE yearly evaluations conducted by NIST are
intended to be of interest to researchers working on the general problem of text independent
speaker recognition. To that end the evaluations are designed to be simple, to focus
on core technology issues, to be fully supported and accessible. The task of the 2005
SRE evaluation was speaker detection, that is, to determine whether a specified speaker
is speaking during a given segment of conversational speech. The task was divided
into 20 distinct and separate tests involving one of five training conditions and
one of four test conditions. Further information about the task conditions is contained
in the The NIST Year 2005 Speaker Recognition Evaluation Plan. The training data for
the 2005 evaluation is available in NIST 2005 Speaker Recognition Evaluation Training
Data LDC2011S01. *Data * The speech data consists of conversational telephone speech
with multi-channel data collected by LDC simultaneously from a number of auxiliary
microphones. The files are organized into two segments: 10 second two-channel excerpts
(continuous segments from single conversations that are estimated to contain approximately
10 seconds of actual speech in the channel of interest) and 5 minute two-channel conversations.
The data are stored as 8-bit u-law speech signals in NIST SPHERE format. In addition
to the standard header fields, the SPHERE header for each file contains some auxiliary
information that includes the language of the conversation and whether the data was
recorded over a telephone line. English language word transcripts in .cmt format were
produced using an automatic speech recognition system (ASR) with error rates in the
range of 15-30%.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish, Russian, English, Mandarin Chinese, and Arabic. Documentation
in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Conversation
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Russian language
- Form subdivision:
Databases.
- General subdivision:
Spoken Russian
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Spanish language
- Form subdivision:
Databases.
- General subdivision:
Spoken Spanish
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635901
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 758-179-408-820-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank: Part 2 v 3.1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Treebank: Part 2 (ATB2) v 3.1 was developed at the Linguistic Data Consortium
(LDC). It consists of 501 newswire stories from Ummah Press with part-of-speech (POS),
morphology, gloss and syntactic treebank annotation in accordance with the Penn Arabic
Treebank (PATB) Guidelines developed in 2008 and 2009. This release represents a significant
revision of LDC's previous ATB2 publication: Arabic Treebank: Part 2 v 2.0 LDC2004T02.
The ongoing PATB project supports research in Arabic-language natural language processing
and human language technology development. The methodology and work leading to the
release of this publication are described in detail in the documentation accompanying
this corpus and in two research papers: Enhancing the Arabic Treebank: A Collaborative
Effort toward New Annotation Guidelines and Consistent and Flexible Integration of
Morphological Annotation in the Arabic Treebank. *Data* ATB2 v 3.1 contains a total
of 144,199 source tokens before clitics are split, and 169,319 tree tokens after clitics
are separated for the treebank annotation. Source texts were selected from Ummah Press
news archives covering the period from July 2001 through September 2002. *Sponsorship*
This work was supported in part by the Defense Advanced Research Projects Agency,
GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not
necessarily reflect the position or the policy of the Government, and no official
endorsement should be inferred.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kulick, Seth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gaddeche, Fatma
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mekki, Wigdan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krouna, Sondos
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bouziri, Basma
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zaghouani, Wajdi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u chi d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 956-489-013-269-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
uzb
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tir
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
tgl
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
pan
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
lao
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
khm
- Language code of text/sound track or separate title:
geo
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
khm
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
wuu
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
uzb
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tir
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
tgl
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
pan
- Language code of text/sound track or separate title:
nan
- Language code of text/sound track or separate title:
lao
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
khm
- Language code of text/sound track or separate title:
kat
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
arz
- Language code of text/sound track or separate title:
ary
- Language code of text/sound track or separate title:
kxm
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
pes
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2008 NIST Speaker Recognition Evaluation Training Set Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2008 NIST Speaker Recognition Evaluation Training Set Part 2, Linguistic Data Consortium
(LDC) catalog number LDC2011S07 and ISBN 1-58563-591-X , was developed by LDC and
NIST (National Institute of Standards and Technology). It contains 950 hours of multilingual
telephone speech and English interview speech along with transcripts and other materials
used as training data in the 2008 NIST Speaker Recognition Evaluation (SRE). SRE is
part of an ongoing series of evaluations conducted by NIST. These evaluations are
an important contribution to the direction of research efforts and the calibration
of technical capabilities. They are intended to be of interest to all researchers
working on the general problem of text independent speaker recognition. To this end
the evaluation is designed to be simple, to focus on core technology issues, to be
fully supported, and to be accessible to those wishing to participate. The 2008 evaluation
was distinguished from prior evaluations, in particular those in 2005 and 2006, by
including not only conversational telephone speech data but also conversational speech
data of comparable duration recorded over a microphone channel involving an interview
scenario. Additional documentation is available at the NIST web site for the 2008
SRE and within the 2008 SRE Evaluation Plan. *Data* The speech data in this release
was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in
Philadelphia and by the International Computer Science Institute (ICSI) at the University
of California, Berkeley. This collection was part of the Mixer 5 project, which was
designed to support the development of robust speaker recognition technology by providing
carefully collected and audited speech from a large pool of speakers recorded simultaneously
across numerous microphones and in different communicative situations and/or in multiple
languages. Mixer participants were native English speakers and bilingual English speakers.
The telephone speech in this corpus is predominately English, but also includes the
above languages. All interview segments are in English. Telephone speech represents
approximately 523 hours of the data, and microphone speech represents the other 427
hours. The telephone speech segments include summed-channel excerpts in the range
of 5 minutes from longer original conversations. The interview material includes single
channel conversation interview segments of at least 8 minutes from a longer interview
session. As in prior evaluations, intervals of silence were not removed. English language
transcripts in .cfm format were produced using an automatic speech recognition (ASR)
system. There are approximately six files distributed as part of SRE08 where each
file is a 1024 byte header with no audio. However, these files were not included in
the trials or keys distributed in the SRE08 aggregate corpus.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese, Wu Chinese, Vietnamese, Uzbek, Urdu, Tigrinya, Thai, Tagalog,
Spanish, Russian, Panjabi, Min Nan Chinese, Lao, Korean, Central Khmer, Georgian,
Japanese, Italian, Hindi, Persian, English, Mandarin Chinese, Bengali, Egyptian Arabic,
Moroccan Arabic, Northern Khmer, Dari, Iranian Persian, Chinese, and Arabic. Documentation
in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635987
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011V06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 325-563-274-981-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part
2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011V06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part
2, Linguistic Data Consortium (LDC) catalog number LDC2011V06 and ISBN 1-58563-598-7,
was developed by researchers at the Department of Computer Science and Engineering,
University of South Florida (USF), Tampa, Florida and the Multimodal Information Group
at the National Institute of Standards and Technology (NIST). It contains approximately
twenty hours of meeting room video data collected in 2005 and 2006 and annotated for
the VACE (Video Analysis and Content Extraction) 2006 face and person tracking tasks.
The VACE program was established to develop novel algorithms for automatic video content
extraction, multi-modal fusion, and event understanding. During VACE Phases I and
II, the program made significant progress in the automated detection and tracking
of moving objects including faces, hands, people, vehicles and text in four primary
video domains: broadcast news, meetings, street surveillance, and unmanned aerial
vehicle motion imagery. Initial results were also obtained on automatic analysis of
human activities and understanding of video sequences. Three performance evaluations
were conducted under the auspices of the VACE program between 2004 and 2007. In 2006,
the VACE program and the European Unions Computers in the Human Interaction Loop (CHIL)
collaborated to hold the Classification of Events, Activities and Relationships (CLEAR)
Evaluation. This was an international effort to evaluate systems designed to analyse
people, their identities, activities, interactions and relationships in human-human
interaction scenarios, as well as related scenarios. The VACE program contributed
the evaluation infrastructure (e.g., data., scoring, tools) for a specific set of
tasks, and the CHIL consortium, coordinated by the Karlsruhe Institute of Technology,
contributed a separate set of evaluation infrastructure. To the extent possible, the
VACE and CHIL programs harmonized their evaluation protocols and metrics. LDC has
previously released: * NIST/USF Evaluation Resources for the VACE Program -- Meeting
Data Training Set Part 1 LDC2011V01 * NIST/USF Evaluation Resources for the VACE Program
-- Meeting Data Training Set Part 2 LDC2011V02 * NIST/USF Evaluation Resources for
the VACE Program -- Meeting Data Test Set Part 1 LDC2011V03 * NIST/USF Evaluation
Resources for the VACE Program -- Meeting Data Test Set Part 2 LDC2011V04 * 2006 NIST/USF
Evaluation Resources for the VACE Program -- Meeting Data Test Set Part 1 LDC2011V05
*Data* The meeting room data used for the 2006 test set was collected by the following
sites in 2005 and 2006: Carnegie Mellon University (USA), University of Edinburgh
(Scotland), IDIAP Research Institute (Switzerland), NIST (USA), Netherlands Organization
for Applied Scientific Research (Netherlands) and Virginia Polytechnic Institute and
State University (USA). Each site had its own independent camera setup, illuminations,
viewpoints, people and topics. Most of the datasets included High-Definition (HD)
recordings, but those were subsequently formatted to MPEG-2 for the evaluation. *Tools*
The VACE evaluation tools have been integrated into NISTs downloadable Framework for
Detection Evaluation (F4DE) Toolkit. The toolkit contains small example files for
each of the task/object/domain scoring combinations.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content-based image retrieval
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Digital video
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kasturi, Rangachar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Goldgof, Dmitry
ADDED ENTRY--PERSONAL NAME
- Personal name:
Manohar, Vasant
ADDED ENTRY--PERSONAL NAME
- Personal name:
Soundararajan, Padmanabhan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bowers, Rachel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rose, Travis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Michel, Martial
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011V06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u fre d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635936
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 447-232-270-158-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
fre
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
fra
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
French Gigaword Third Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
French Gigaword Third Edition is a comprehensive archive of newswire text data that
has been acquired over several years by the Linguistic Data Consortium (LDC) at the
University of Pennsylvania. This third edition updates French Gigaword Second Edition
(LDC2009T28) and adds material collected from January 1, 2009 through December 31,
2010. The two distinct international sources of French newswire in this edition, and
the time spans of collection covered for each, are as follows: * Agence France-Presse(afp_fre)
May 1994 - Dec. 2010 * Associated Press French Service (apw_fre) Nov. 1994 - Dec.
2010 The seven-letter codes in parentheses include the three-character source name
abbreviations and the three-character language code (fre) separated by an underscore
(_) character. The three-letter language code conforms to the ISO 639-2/B standard.
*Data* Each data file name consists of the 7-letter prefix plus another underscore
character, followed by a 6-digit date (representing the year and month during which
the file contents were generated by the respective news source), followed by a .gz
file extension, indicating that the file contents have been compressed using the GNU
gzip compression utility (RFC 1952). So, each file contains all the usable data received
by LDC for the given month from the given news source. All text data are presented
in SGML form, using a very simple, minimal markup structure all text consists of printable
ASCII, white space, and printable code points in the Latin1 Supplement character table,
as defined by the Unicode Standard (ISO 10646) for the accented characters used in
French. The Supplement/accented characters are presented in UTF-8 encoding. The file
dtd/gigaword_f.dtd in the dtd directory provides the formal Document Type Declaration
for parsing the SGML content. The corpus has been fully validated by a standard SGML
parser utility (nsgmls), using this DTD file. The SGML structure for this release
represents some notable differences relative to the markup strategy used in early
(pre-Gigaword) LDC publications of newswire data these are intended to facilitate
bulk processing of the present corpus. The major differences are: * Early corpora
usually organized the data as one file per day, or limited the average file size to
one megabyte (MB). Typical compressed file sizes in the current corpus range from
about 0.1 MB to about 10 MB this equates to a range of about 0.5 to 30 MB per file
when the data are uncompressed. In general, these files are not intended for use with
interactive text editors or word processing software (though many such programs are
likely to work reasonably well with these files). Rather, its expected that the files
will be used as input to programs that are geared to dealing with data in such quantities,
for filtering, conditioning, indexing, statistical summary, etc. * Early corpora tended
to use different markup outlines (different tag sets) depending on the data source
the data source structural properties were generally preserved to the extent possible
(even though many elements of the delivered structure may have been meaningless for
research use). The present corpus uses only the information structure that is common
to all sources and serves a clear function: headline, dateline, and core news content
(usually containing paragraphs). The dateline is a brief string typically found at
the beginning of the first paragraph in each news story, giving the location the report
is coming from, and sometimes the news service and/or date since this content is not
part of the initial sentence, we separate it from the first paragraph (this was not
done prior to the Gigaword corpora). For all of the documents in this corpus, we have
applied a rudimentary (and approximate) categorization of DOC units into four distinct
types. The classification is indicated by the type=string attribute that is included
in each opening DOC tag. The four types are: * story : This is by far the most frequent
type, and it represents the most typical newswire item: a coherent report on a particular
topic or event, consisting of paragraphs and full sentences. * multi : This type of
DOC contains a series of unrelated blurbs, each of which briefly describes a particular
topic or event this is typically applied to DOCs that contain summaries of todays
news, news briefs in ... (some general area like finance or sports), and so on. *
advis : (short for advisory) These are DOCs which the news service addresses to news
editors -- they are not intended for publication to the end users (the populations
who read the news). * other : This represents DOCs that clearly do not fall into any
of the above types -- in general, items of this type are intended for broad circulation
(they are not advisories), they may be topically coherent (unlike multi type DOCS),
and they typically do not contain paragraphs or sentences (they are not really stories)
these are things like lists of sports scores, stock prices, temperatures around the
world, and so on. The overall totals for each source are summarized below. Note that
the Totl-MB numbers show the amount of data when the files are uncompressed (i.e.
approximately 15 gigabytes, total) the Gzip-MB column shows totals for compressed
file sizes, the K-wrds numbers are simply the number of white space-separated tokens
(of all types) after all SGML tags are eliminated. Source#FilesGzip-MBTotl-MBK-wrds#DOCs
afp_fre 195 1503 4255 641381 2356888 apw_fre 194 489 1446 221470 801075 TOTAL 389
1992 5701 862851 3157963 *Sample* Please view this sample.
LANGUAGE NOTE
- Language note:
Content in French. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
French language
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mendonça, Ângelo
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635928
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011V05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 362-227-362-046-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part
1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011V05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part
1, Linguistic Data Consortium (LDC) catalog number LDC2011V05 and isbn 1-58563-576-6,
was developed by researchers at the Department of Computer Science and Engineering,
University of South Florida (USF), Tampa, Florida and the Multimodal Information Group
at the National Institute of Standards and Technology (NIST). It contains approximately
fifteen hours of meeting room video data collected in 2005 and 2006and annotated for
the VACE (Video Analysis and Content Extraction) 2006 face and person tracking tasks.
The VACE program was established to develop novel algorithms for automatic video content
extraction, multi-modal fusion, and event understanding. During VACE Phases I and
II, the program made significant progress in the automated detection and tracking
of moving objects including faces, hands, people, vehicles and text in four primary
video domains: broadcast news, meetings, street surveillance, and unmanned aerial
vehicle motion imagery. Initial results were also obtained on automatic analysis of
human activities and understanding of video sequences. Three performance evaluations
were conducted under the auspices of the VACE program between 2004 and 2007. In 2006,
the VACE program and the European Unions Computers in the Human Interaction Loop (CHIL)
collaborated to hold the CLassification of Events, Activities and Relationships (CLEAR)
Evaluation. This was an international effort to evaluate systems designed to analyze
people, their identities, activities, interactions and relationships in human-human
interaction scenarios, as well as related scenarios. The VACE program contributed
the evaluation infrastructure (e.g., data., scoring, tools) for a specific set of
tasks, and the CHIL consortium, coordinated by the Karlsruhe Institute of Technology,
contributed a separate set of evaluation infrastructure. To the extent possible, the
VACE and CHIL programs harmonized their evaluation protocols and metrics. LDC has
previously released NIST/USF Evaluation Resources for the VACE Program -- Meeting
Data Training Set Part 1 LDC2011V01 NIST/USF Evaluation Resources for the VACE Program
-- Meeting Data Training Set Part 2 LDC2011V02 NIST/USF Evaluation Resources for the
VACE Program -- Meeting Data Test Set Part 1 LDC2011V03 and NIST/USF Evaluation Resources
for the VACE Program -- Meeting Data Test Set Part 2 LDC2011V04. *Data* The meeting
room data used for the 2006 test set was collected by the following sites in 2005
and 2006: Carnegie Mellon University (USA), University of Edinburgh (Scotland), IDIAP
Research Institute (Switzerland), NIST (USA), Netherlands Organization for Applied
Scientific Research (Netherlands) and Virginia Polytechnic Institute and State University
(USA). Each site had its own independent camera setup, illuminations, viewpoints,
people and topics. Most of the datasets included High-Definition (HD) recordings,
but those were subsequently formatted to MPEG-2 for the evaluation. *Tools* The VACE
evaluation tools have been integrated into NISTs downloadable Framework for Detection
Evaluation (F4DE) Toolkit. The toolkit contains small example files for each of the
task/object/domain scoring combinations.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content-based image retrieval
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Digital video
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kasturi, Rangachar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Goldgof, Dmitry
ADDED ENTRY--PERSONAL NAME
- Personal name:
Manohar, Vasant
ADDED ENTRY--PERSONAL NAME
- Personal name:
Soundararajan, Padmanabhan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bowers, Rachel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rose, Travis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Michel, Martial
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011V05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636045
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 347-741-147-064-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ModeS TimeBank 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ModeS TimeBank 1.0 was developed by researchers at Technical University of Madrid
and Barcelona Media and is a corpus of Modern Spanish (17th and 18th centuries) annotated
with temporal and event information according to TimeML mark-up and annotated with
spatial information following the SpatialML scheme. TimeML (Pustejovsky et al., 2005)
is a specification language for annotating eventualities and time expressions in natural
language as well as the temporal relations among them, thus facilitating the task
of extraction, representation and exchange of temporal information. SpatialML (Mani
et al., 2008) is a specification language for annotating and normalizing spatial expressions
by means of geographic coordinates. LDC has released the following corpora incorporating
TimeML or SpatialML annotation: TimeBank 1.2 LDC2006T08, FactBank 1.0 LDC2009T23,
ACE 2005 English SpatialML Annotations Version 2 LDC2011T02 and ACE 2005 Mandarin
SpatialML Annotations LDC2010T09. *Data* ModeS TimeBank 1.0 contains 102 documents
reporting a sea-crossing cruise by a ship called La Princesa, which took place from
December 1768 to April 1769. There exist copious logbooks from that period that not
only provide information about shipping routes, but also contain valuable data concerning
information flows, commercial agents and social networks. The original corpus manuscript
is preserved in the Archivo General de Indias (General Archive of the Indies) and
is available online at the Portal de Archivos Espa?oles. This corpus was created within
the framework of the DynCoopNet project (Dynamic Compatibility of Cooperation-Based
Self-Organizing Networks in the First Global Age) which is focused on the study of
trade network cooperation during the 15th-19th centuries and incorporates into its
work maps, charts, databases and natural language documents. All text is encoded in
UTF-8. The data in ModeS TimeBank 1.0 has been tokenized, POS-tagged, and annotated
with space, time and event information according to the TimeML and SpatialML specification
schemes. More specifically, the entities annotated in the corpus are the following:
* Events: (tag EVENT, from TimeML). These include finite and non-finite verbal constructions,
nominalizations, nouns, adjectives and prepositional phrases. * Temporal expressions
(tag TIMEX3, from TimeML). These includeg expressions of dates, times, durations and
frequencies, both precise and vague. * Spatial expressions (tag PLACE, from SpatialML).
These are used for proper and common nouns, adjectives, adverbs or spatial coordinates.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Spanish language
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Nieto, Marta Guerrero
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sauri, Roser
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635944
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 289-720-923-302-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
uzb
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
tgl
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
pan
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
lao
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
wuu
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
uzb
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
tgl
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
pan
- Language code of text/sound track or separate title:
nan
- Language code of text/sound track or separate title:
lao
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
ita
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
arz
- Language code of text/sound track or separate title:
ary
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
pes
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2008 NIST Speaker Recognition Evaluation Test Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2008 NIST Speaker Recognition Evaluation Test Set was developed by the Linguistic
Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It
contains 942 hours of multilingual telephone speech and English interview speech along
with transcripts and other materials used as test data in the 2008 NIST Speaker Recognition
Evaluation (SRE). NIST SRE is part of an ongoing series of evaluations conducted by
NIST. These evaluations are an important contribution to the direction of research
efforts and the calibration of technical capabilities. They are intended to be of
interest to all researchers working on the general problem of text independent speaker
recognition. To this end the evaluation is designed to be simple, to focus on core
technology issues, to be fully supported, and to be accessible to those wishing to
participate. The 2008 evaluation was distinguished from prior evaluations, in particular
those in 2005 and 2006, by including not only conversational telephone speech data
but also conversational speech data of comparable duration recorded over a microphone
channel involving an interview scenario. LDC previously released the 2008 NIST SRE
Training Set in two parts as LDC2011S05 and LDC2011S07. Additional documentation is
available at the NIST web site for the 2008 SRE and within the 2008 SRE Evaluation
Plan. *Data* The speech data in this release was collected in 2007 by LDC at its Human
Subjects Collection facility in Philadelphia and by the International Computer Science
Institute (ICSI) at the University of California, Berkeley. This collection was part
of the Mixer 5 project, which was designed to support the development of robust speaker
recognition technology by providing carefully collected and audited speech from a
large pool of speakers recorded simultaneously across numerous microphones and in
different communicative situations and/or in multiple languages. Mixer participants
were native English and bilingual English speakers. The telephone speech in this corpus
is predominantly English, but also includes the above languages. All interview segments
are in English. Telephone speech represents approximately 368 hours of the data, whereas
microphone speech represents the other 574 hours. The telephone speech segments include
two-channel excerpts of approximately 10 seconds and 5 minutes. There are also summed-channel
excerpts in the range of 5 minutes. The microphone excerpts are either 3 or 8 minutes
in length. As in prior evaluations, intervals of silence were not removed. There are
approximately six files distributed as part of SRE08 where each file is a 1024 byte
header with no audio. However, these files were not included in the trials or keys
distributed in the SRE08 aggregate corpus. English language transcripts in .cfm format
were produced using an automatic speech recognition (ASR) system.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese, Wu Chinese, Vietnamese, Uzbek, Urdu, Thai, Tagalog, Tamil,
Russian, Panjabi, Min Nan Chinese, Lao, Korean, Japanese, Italian, Hindi, Persian,
Mandarin Chinese, Bengali, Egyptian Arabic, Moroccan Arabic, Dari, Iranian Persian,
English, Chinese, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635952
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 494-144-988-211-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Gigaword Fifth Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Gigaword Fifth Edition, Linguistic Data Consortium (LDC) catalog number LDC2011T11
and ISBN 1-58563-595-2, was produced by LDC. It is a comprehensive archive of newswire
text data that has been acquired from Arabic news sources by LDC at the University
of Pennsylvania. Arabic Gigaword Fifth Edition includes all of the content of the
fourth edition of Arabic Gigaword (LDC2009T30) plus new data covering the period from
January 2009 through December 2010. Nine distinct sources of Arabic newswire are represented
here: * Asharq Al-Awsat (aaw_arb) * Agence France Presse (afp_arb) * Al-Ahram (ahr_arb)
* Assabah (asb_arb) * Al Hayat (hyt_arb) * An Nahar (nhr_arb) * Al-Quds Al-Arabi (qds_arb)
* Ummah Press (umh_arb) * Xinhua News Agency (xin_arb) The seven-character codes shown
above represent both the directory names where the data files are found, and the 7-letter
prefix that appears at the beginning of every file name. The 7-letter codes consist
of the three-character source name IDs and the three-character language code (arb)
separated by an underscore (_) character. The three-character language code conforms
to the ISO 639-3 standard. In addition to adding new data, the following updates were
made: * Repeated documents in Asharq Al-Awsat data from 2008 were removed. * Document
formatting and docid duplication problems were corrected in Agence France Presse (AFP)
data. * Significant duplication of content in 2007-2008 An Nahar data was detected,
and the duplicated documents were removed. More details about these changes can be
found in the included readme file. *Data* All text data are presented in SGML form,
using a very simple, minimal markup structure. For every opening tag (DOC, HEADLINE,
DATELINE, TEXT, P), there is a corresponding closing tag -- always. The attribute
values in the DOC tag are always presented within double-quotes the id= attribute
of DOC consists of the 7-letter source abbreviation (in CAPS), an underscore character,
an 8-digit date string representing the date of the story (YYYYMMDD), a period, and
a 4-digit sequence number starting at 0001 for each date (e.g. XIN_ARB_200101.0001)
in this way, every DOC in the corpus is uniquely identifiable by the id string. For
this release, all sources have received a uniform treatment in terms of quality control,
and we have applied a rudimentary (and _approximate_) categorization of DOC units
into four distinct types. The classification is indicated by the type=string attribute
that is included in each opening DOC tag. The four types are: * story: This is by
far the most frequent type, and it represents the most typical newswire item: a coherent
report on a particular topic or event, consisting of paragraphs and full sentences.
* multi: This type of DOC contains a series of unrelated blurbs, each of which briefly
describes a particular topic or event this is typically applied to DOCs that contain
summaries of todays news, news briefs in ... (some general area like finance or sports),
and so on. * other: This represents DOCs that clearly do not fall into any of the
above types -- in general, items of this type are intended for broad circulation (they
are not advisories), they may be topically coherent (unlike multi type DOCs), and
they typically do not contain paragraphs or sentences (they arent really stories)
these are things like lists of sports scores, stock prices, temperatures around the
world, and so on. Other Gigaword corpora (e.g., in English and Chinese) have a fourth
category, advis (for advisory), which applies to DOCs that contain text intended solely
for news service editors, not the news-reading public. The task of determining patterns
for assigning non-story type labels was carried out by a native speaker of Arabic,
and the advis category was determined to be inapplicable to the data. Note that the
markup was applied algorithmically, using logic that was based on less-than-complete
knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have
their intended content but due to the inherent variability (and the inevitable source
errors) in the data, users may find occasional mishaps where the headline and/or dateline
were not successfully identified (hence show up within TEXT), or where an initial
sentence or paragraph has been mistakenly tagged as the headline or dateline. *Sample*
Please view this sample. *Sponsorship* This work was supported in part by the Defense
Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content
of this publication does not necessarily refelct the position or policy of the Government,
and no official endorsement should be inferred.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Parker, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635960
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 595-627-966-073-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Spanish Gigaword Third Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Spanish Gigaword Third Edition, Linguistic Data Consortium (LDC) catalog number LDC2011T12
and ISBN 1-58563-596-0, was produced by LDC. It is a comprehensive archive of Spanish
newswire text data that has been acquired over several years by LDC. Spanish Gigaword
Third Edition includes all of the content of the second edition (LDC2009T21) and adds
data collected from January 1, 2009 through December 31, 2010. The three distinct
international sources of Spanish newswire in this edition, and the time spans of collection
covered for each, are as follows: * Agence France-Presse, Spanish (afp_spa) May 1994
- Dec 2010 * Associated Press, Spanish (apw_spa) Nov 1993 - Dec 2010 * Xinhua News
Agency, Spanish (xin_spa) Sep 2001 - Dec 2010 The seven-letter codes in the parentheses
above include the three-character source name abbreviations and the three-character
language code (spa) separated by an underscore (_) character. The three-letter language
code conforms to LDCs internal convention based on the ISO 639-3 standard. *Data*
All text data are presented in SGML/XML form, using a very simple, minimal markup
structure all text consists of printable ASCII, whitespace, and printable code points
in the Latin1 Supplement character table, as defined by both ISO-8859-1 and the Unicode
Standard (ISO 10646) for the accented characters used in Spanish. The Supplement/accented
characters are rendered using UTF-8 encoding. For all of the documents in this corpus,
we have applied a rudimentary (and _approximate_) categorization of DOC units into
four distinct types. The classification is indicated by the type=string attribute
that is included in each opening DOC tag. The four types are: * story : This is by
far the most frequent type, and it represents the most typical newswire item: a coherent
report on a particular topic or event, consisting of paragraphs and full sentences.
* multi : This type of DOC contains a series of unrelated blurbs, each of which briefly
describes a particular topic or event this is typically applied to DOCs that contain
summaries of todays news, news briefs in ... (some general area like finance or sports),
and so on. * advis : (short for advisory) These are DOCs which the news service addresses
to news editors -- they are not intended for publication to the end users (the populations
who read the news). This type contains formulaic, repetitive content (contact phone
numbers, etc). * other : This represents DOCs that clearly do not fall into any of
the above types -- in general, items of this type are intended for broad circulation
(they are not advisories), they may be topically coherent (unlike multi type DOCS),
and they typically do not contain paragraphs or sentences (they arent really stories)
these are things like lists of sports scores, stock prices, temperatures around the
world, and so on. *Sample* Please view this sample.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Spanish language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mendonça, Ângelo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jaquette, Daniel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635979
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 182-553-395-072-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2006 NIST Speaker Recognition Evaluation Training Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2006 NIST Speaker Recognition Evaluation Training Set was developed by LDC and NIST
(National Institute of Standards and Technology). It contains 595 hours of conversational
telephone speech in English, Arabic, Bengali, Chinese, Hindi, Korean, Russian, Thai
and Urdu and associated English transcripts used as training data in the NIST-sponsored
2006 Speaker Recognition Evaluation (SRE). The ongoing series of SRE yearly evaluations
conducted by NIST are intended to be of interest to researchers working on the general
problem of text independent speaker recognition. To this end the evaluations are designed
to be simple, to focus on core technology issues, to be fully supported and to be
accessible to those wishing to participate. The task of the 2006 SRE evaluation was
speaker detection, that is, to determine whether a specified speaker is speaking during
a given segment of conversational telephone speech. The task was divided into 15 distinct
and separate tests involving one of five training conditions and one of four test
conditions. Further information about the test conditions and additional documentation
is available at the NIST web site for the 2006 SRE and within the 2006 SRE Evaluation
Plan. *Data* The speech data in this release was collected by LDC as part of the Mixer
project, in particular Mixer Phases 1, 2 and 3. The Mixer project supports the development
of robust speaker recognition technology by providing carefully collected and audited
speech from a large pool of speakers recorded simultaneously across numerous microphones
and in different communicative situations and/or in multiple languages. The data is
mostly English speech, but includes some speech in Arabic, Bengali, Chinese, Hindi,
Korean, Russian, Thai and Urdu. The telephone speech segments are multi-channel data
collected simultaneously from a number of auxiliary microphones. The files are organized
into three types: two-channel excerpts of approximately 10 seconds, two-channel conversations
of approximately 5 minutes and summed-channel conversations also of approximately
5 minutes. The speech files are stored as 8-bit u-law speech signals in separate SPHERE
files. In addition to the standard header fields, the SPHERE header for each file
contains some auxiliary information that includes the language of the conversation
and whether the data was recorded over a telephone line. English language transcripts
in .ctm format were produced using an automatic speech recognition (ASR) system.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese, Urdu, Thai, Russian, Korean, Hindi, English, Mandarin Chinese,
Bengali, Standard Arabic, Chinese, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585635995
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 162-966-215-437-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Gigaword Fifth Edition
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Gigaword Fifth Edition was produced by the Linguistic Data Consortium (LDC).
It is a comprehensive archive of newswire text data that has been acquired from Chinese
news sources by LDC at the University of Pennsylvania. Chinese Gigaword Fifth Edition
includes all of the content of the fourth edition of Chinese Gigaword (LDC2009T27)
plus new data covering the period from January 2009 through December 2010. Eight distinct
sources of Chinese newswire are represented here: * Agence France Presse(afp_cmn)
* Central News Agency, Taiwan(cna_cmn) * Central News Service(cns_cmn) * Guangming
Daily(gmw_cmn) * Peoples Daily(pda_cmn) * Peoples Liberation Army Daily(pla_cmn) *
Xinhua News Agency(xin_cmn) * Zaobao Newspaper(zbn_cmn) The seven-letter codes in
the parentheses above are used for the directory names and data files for each source
and are also used (in ALL_CAPS) as part of the unique DOC id string assigned to each
news article. Articles covering the period from January 2009 through December 2010
have been added to the Agence France Presse, Central News Agency (CNA), Central News
Service, Guangming Daily, Peoples Liberation Army Daily and Xinhua News Agency data
sets. The data from Peoples Daily covers the period from late June 2009 through December
2010. No new data from Zaobao has been added. Additionally, Zaobao and CNA data included
in previous releases were found to contain non-normalized full-width characters. Those
files have been normalized to correct that issue. *Data* Each data file name consists
of the 7-letter prefix (e.g., xin_cmn) and an underscore character (_) followed by
a 6-digit date (representing the year and month during which the file contents were
originally published by the respective news source), followed by a .gz file extension,
indicating that the file contents have been compressed using the GNU gzip compression
utility (RFC 1952). So, each file contains all the usable data received by LDC for
the given month from the given news source. All text data are presented in SGML form,
using a very simple, minimal markup structure. The file gigaword_c.dtd in the docs
directory provides the formal Document Type Declaration for parsing the SGML content.
The corpus has been fully validated by a standard SGML parser utility (nsgmls), using
this DTD file. For this release, all sources have received a uniform treatment in
terms of quality control, and we have applied a rudimentary (and _approximate_) categorization
of DOC units into four distinct types. The classification is indicated by the type=string
attribute that is included in each opening DOC tag. The four types are: * story: This
is by far the most frequent type, and it represents the most typical newswire item:
a coherent report on a particular topic or event, consisting of paragraphs and full
sentences. * multi: This type of DOC contains a series of unrelated blurbs, each of
which briefly describes a particular topic or event this is typically applied to DOCs
that contain summaries of todays news, news briefs in ... (some general area like
finance or sports), and so on. * advis : (short for advisory) These are DOCs which
the news service addresses to news editors -- they are not intended for publication
to the end users (the populations who read the news). We also find a lot of formulaic,
repetitive content in DOCs of this type (contact phone numbers, etc). * other: This
represents DOCs that clearly do not fall into any of the above types -- in general,
items of this type are intended for broad circulation (they are not advisories), they
may be topically coherent (unlike multi type DOCs), and they typically do not contain
paragraphs or sentences (they arent really stories) these are things like lists of
sports scores, stock prices, temperatures around the world, and so on. *Sample* Please
view this sample.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Foreign news
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Parker, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kong, Junbo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636002
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 293-615-042-213-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
tha
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
pes
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2006 NIST Speaker Recognition Evaluation Test Set Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2006 NIST Speaker Recognition Evaluation Test Set Part 1 was developed by LDC and
NIST (National Institute of Standards and Technology). It contains 437 hours of conversational
telephone and microphone speech in English, Arabic, Bengali, Chinese, Farsi, Hindi,
Korean, Russian, Spanish, Thai and Urdu and associated English transcripts used as
test data in the NIST-sponsored 2006 Speaker Recognition Evaluation (SRE). The ongoing
series of SRE yearly evaluations conducted by NIST are intended to be of interest
to researchers working on the general problem of text independent speaker recognition.
To this end the evaluations are designed to be simple, to focus on core technology
issues, to be fully supported and to be accessible to those wishing to participate.
The task of the 2006 SRE evaluation was speaker detection, that is, to determine whether
a specified speaker is speaking during a given segment of conversational telephone
speech. The task was divided into 15 distinct and separate tests involving one of
five training conditions and one of four test conditions. Further information about
the test conditions and additional documentation is available at the NIST web site
for the 2006 SRE and within the 2006 SRE Evaluation Plan. LDC also previously released
2006 NIST Speaker Recognition Evaluation Training Set. *Data* The speech data in this
release was collected by LDC as part of the Mixer project, in particular Mixer Phases
1, 2 and 3. The Mixer project supports the development of robust speaker recognition
technology by providing carefully collected and audited speech from a large pool of
speakers recorded simultaneously across numerous microphones and in different communicative
situations and/or in multiple languages. The data is mostly English speech, but includes
some speech in Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai
and Urdu. The telephone speech segments are multi-channel data collected simultaneously
from a number of auxiliary microphones. The files are organized into four types: two-channel
excerpts of approximately 10 seconds, two-channel conversations of approximately 5
minutes, summed-channel conversations also of approximately 5 minutes and a two-channel
conversation with the usual telephone speech replaced by auxiliary microphone data
in the putative target speaker channel. The auxiliary microphone conversations are
also of approximately five minutes in length. The speech files are stored as 8-bit
u-law speech signals in separate SPHERE files. In addition to the standard header
fields, the SPHERE header for each file contains some auxiliary information such as
the language of the conversation. English language transcripts in .ctm format were
produced using an automatic speech recognition (ASR) system.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese, Urdu, Thai, Spanish, Russian, Korean, Hindi, Persian, English,
Mandarin Chinese, Bengali, Standard Arabic, Dari, Iranian Persian, Chinese, and Arabic.
Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2011 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636010
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2011S11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 332-216-006-330-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2008 NIST Speaker Recognition Evaluation Supplemental Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2011]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2011S11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2008 NIST Speaker Recognition Evaluation Supplemental Set, Linguistic Data Consortium
(LDC) catalog number LDC2011S11 and ISBN 1-58563-601-0, was developed by LDC and NIST
(National Institute of Standards and Technology) and contains additional data distributed
after the main 2008 Speaker Recognition Evaluation (SRE). Specifically, the corpus
consists of 770 hours of English microphone speech along with transcripts and other
materials used as supplemental data in the 2008 NIST Speaker Recognition Evaluation
(SRE) and in a follow-up evaluation to SRE08. NIST SRE is part of an ongoing series
of evaluations conducted by NIST. These evaluations are an important contribution
to the direction of research efforts and the calibration of technical capabilities.
They are intended to be of interest to all researchers working on the general problem
of text independent speaker recognition. To this end the evaluation is designed to
be simple, to focus on core technology issues, to be fully supported, and to be accessible
to those wishing to participate. The 2008 evaluation was distinguished from prior
evaluations, in particular those in 2005 and 2006, by including not only conversational
telephone speech data but also conversational speech data of comparable duration recorded
over a microphone channel involving an interview scenario.The follow-up evaluation
focused on speaker detection in the context of conversational interview type speech
and was designed to measure the performance of SRE08 systems in previously unexposed
test segment channel conditions. LDC previously released the main 2008 NIST SRE Evaluation
in three parts as 2008 NIST Speaker Recognition Evaluation Training Set Part 1 LDC2011S05,
2008 NIST Speaker Recognition Evaluation Training Set Part 2 LDC2011S07 and 2008 NIST
Speaker Recognition Evaluation Test Set LDC2011S08. Additional documentation is available
at the NIST web site for the 2008 SRE and within the 2008 SRE Evaluation Plan and
the Plan for Follow-up Evaluation to SRE08. *Data* The speech data in this release
was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in
Philadelphia and by the International Computer Science Institute (ICSI) at the University
of California, Berkeley. This collection was part of the Mixer 5 project, which was
designed to support the development of robust speaker recognition technology by providing
carefully collected and audited speech from a large pool of speakers recorded simultaneously
across numerous microphones and in different communicative situations and/or in multiple
languages. Mixer participants were native English and bilingual English speakers.
The microphone speech in this corpus is in English and consists of approximately 3
minute and 30 minute interview excerpts.. This supplemental data is split into four
different parts which provide: * new training data distributed to 2008 SRE participants
* additional data distributed to participants in the 2008 SRE follow-up evaluation
* interviewer channel files for the 2008 SRE main test (released after the evaluations)
* supplemental training data (released after the evaluations) English language transcripts
in .cfm format were produced using an automatic speech recognition (ASR) system and
are included for some, but not all, speech data.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2011S11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636037
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 239-915-065-794-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TORGO Database of Dysarthric Articulation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TORGO Database of Dysarthric Articulation was developed by the University of Toronto's
departments of Computer Science and Speech Language Pathology in collaboration with
the Holland-Bloorview Kids Rehabilitation Hospital in Toronto, Canada. It contains
approximately 23 hours of English speech data, accompanying transcripts and documentation
from 8 speakers (5 males, 3 females) with cerebral palsy (CP) or amyotrophic lateral
sclerosis (ALS) and from 7 speakers (4 males, 3 females) from a non-dysarthric control
group. CP and ALS are examples of dysarthria which is caused by disruptions in the
neuro-motor interface that distort motor commands to the vocal articulators, resulting
in atypical and relatively unintelligible speech in most cases. The TORGO database
is primarily a resource for developing advanced automatic speaker recognition (ASR)
models suited to the needs of people with dysarthria, but it is also applicable to
non-dysarthric speech. The inability of modern ASR to effectively understand dysarthric
speech is a problem since the more general physical disabilities often associated
with the condition can make other forms of computer input, such as computer keyboards
or touch screens, difficult to use. *Data* The data consists of aligned acoustics
and measured 3D articulatory features from the speakers carried out using the 3D AG500
electro-magnetic articulograph (EMA) system (Carstens Medizinelektronik GmbH, Lenglern,
Germany) with fully-automated calibration. This system allows for 3D recordings of
articulatory movements inside and outside the vocal tract, thus providing a detailed
window on the nature and direction of speech-related activity. The data was collected
between 2008 and 2010 in Toronto, Canada. All subjects read text consisting of non-words,
short words and restricted sentences from a 19-inch LCD screen. The restricted sentences
included 162 sentences from the sentence intelligibility section of Assessment of
intelligibility of dysarthric speech (Yorkston & Beukelman, 1981) and 460 sentences
derived from the TIMIT database. The unrestricted sentences were elicited by asking
participants to spontaneously describe 30 images in interesting situations taken randomly
from Webber Photo Cards - Story Starters (Webber, 2005), designed to prompt students
to tell or write a story. Data is organized by speaker and by the session in which
each speaker recorded data. Each speaker was assigned a code and given their own file
directory. The code for female speakers begins with F, and the code for male speakers
begins with M. If the speaker was a member of the control group, the letter C follows
the gender code. The last two digits of the code indicate the order in which that
subject was recruited. For example, speaker FC02 was the second female speaker without
dysarthria recruited. Note that some speakers were intentionally left out of the data,
and thus, there are gaps in the numbering. Each speakers directory contains Session
directories which encapsulate data recorded in the respective visit and occasionally,
a Notes directory which can include Frenchay assessments (test for the measurement,
description and diagnosis of dysarthria), notes about sessions (e.g., sensor errors),
and other relevant notes. Each Session directory can, but does not necessarily, contain
the following content: * alignment.txt: This is a text file containing the sample
offsets between audio files recorded simultaneously by the array microphone and the
head-worn microphone. * amps: These directories contain raw *.amp and *.ini files
produced by the AG500 articulograph. * phn_*: These directories contain phonemic transcriptions
of audio data. Each file is plain text with a *.PHN file extensions and a filename
referring to the utterance number. These files were generated using the free Wavesurfer
tool. * pos: These directories contain the head-corrected positions, velocities, and
orientations of sensor coils for each utterance, as generated by the AG500 articulograph.
* prompts: These directories contain orthographic transcriptions. * rawpos: These
directories are equivalent to the pos directories except that their articulographic
content is not head-normalized to a constant upright position. * wav_*: These directories
contain the acoustics. Each file is a RIFF (little-endian) WAVE audio file (Microsoft
PCM, 16 bit, mono 16000 Hz). * wavall: These directories contains a stereo recording
in which one channel contains the recorded acoustics and the other channel contains
the analog peaks associated with the sweep signal, which is used by the AG500 hardware
for synchronization. Additionally, sessions recorded with the AG500 articulograph
are marked with the file EMA, and those recorded with the video-based system are marked
with the file VIDEO. Files with a date form as the filename and a txt extension (e.g.
april232008cal2.txt, jan28cal3.txt) are the measured responses from calibration. The
*.log and *.calset files contain descriptions of the calibration process, but not
the final result of calibration. See the readme file and the AG500 Wiki for more complete
descriptions of the possible subfolders and of the AG500 specific files. Also, see
session_contents.tsv for a tab separated table of each sessions subfolders and metadata
files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
Canada
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Cerebral palsied
- Form subdivision:
Databases.
- General subdivision:
Language
- Geographic subdivision:
Canada
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Amyotrophic lateral sclerosis
- Form subdivision:
Databases.
- General subdivision:
Patients
- General subdivision:
Language
- Geographic subdivision:
Canada
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rudzicz, Frank
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hirst, Graeme
ADDED ENTRY--PERSONAL NAME
- Personal name:
van Lieshout, Pascal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Penn, Gerald
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shein, Fraser
ADDED ENTRY--PERSONAL NAME
- Personal name:
Namasivayam, Aravind
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wolff, Talya
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636053
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 167-450-243-260-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Digital Archive of Southern Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Digital Archive of Southern Speech (DASS) was developed by the University of Georgia.
It is a subset of the Linguistic Atlas of the Gulf States (LAGS), which is in turn
part of the Linguistic Atlas Project (LAP). DASS contains approximately 370 hours
of English speech data from 30 female speakers and 34 male speakers in .wav format
and in .mp3 format, along with associated metadata about the speakers and the recordings
and maps in .jpeg format relating to the recording locations. LAP consists of a set
of survey research projects about the words and pronunciation of everyday American
English, the largest project of its kind in the United States. Interviews with thousands
of native speakers across the country have been carried out since 1929. LAGS surveyed
the everyday speech of Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas,
Louisiana, and Texas in a series of 914 audio-taped interviews conducted from 1968-1983.
Interviews average approximately six hours in length the systematic LAGS tape archive
amounts to 5500 hours of sound recordings. DASS is a collection of 64 interviews from
LAGS selected to cover a range of speech across the region and to represent multiple
education levels and ethnic backgrounds. This release is distributed on an external
hard drive and contains instructions for using the media and navigating to the LICHEN
program. Digital Archive of Southern Speech - NLP Version (LDC2016S05), an alternate
version suitable for natural language processing and human language technology applications
is also available. *Data* The DASS speakers average age is 61 years there are 30 women
and 34 men from the Gulf States region represented in this release. The interviews
cover common topics such as family, the weather, household articles and activities,
agriculture and social connections. The interviews were originally recorded in the
field on reel-to-reel audio tape. A digital version of every reel of tape was then
made, one .wav file per reel, usually about one hour of sound. Each interview thus
consists of a set of 3 to 13 reels, or roughly 3 to 13 interview hours. Personally
identifying or sensitive information in the files was replaced with a tone to protect
the privacy and to assure ethical treatment of speakers. Each .wav file is split into
multiple .mp3 files based on the topic of conversation and labeled thusly. Included
spreadsheets provide information about the speakers, the labels used for topics and
the sound files. Also included in this release is a version of the LICHEN software
developed at the University of Oulu, Finland. LICHEN allows users to browse and search
through the audio data in a more advanced fashion using a graphical interface. Further
information and instructions for LICHEN can be found within the docs folder of this
release.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
- Geographic subdivision:
Southern States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Discourse analysis.
- Geographic subdivision:
United States
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kretzschmar, William A., Jr.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bounds, Paulina
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hettel, Jacqueline
ADDED ENTRY--PERSONAL NAME
- Personal name:
Coats, Steven
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pederson, Lee
ADDED ENTRY--PERSONAL NAME
- Personal name:
Opas-Hänninen, Lisa Lena
ADDED ENTRY--PERSONAL NAME
- Personal name:
Juuso, Ilkka
ADDED ENTRY--PERSONAL NAME
- Personal name:
Seppänen, Tapio
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u dra d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636061
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 841-757-472-203-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
dra
- Language code of text/sound track or separate title:
dra
- Language code of text/sound track or separate title:
dra
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
mkb
- Language code of text/sound track or separate title:
mjt
- Language code of text/sound track or separate title:
kmj
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Malto Speech and Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Malto Speech and Transcripts was developed by Masato Kobayashi, Associate Professor
in Linguistics at the University of Tokyo (Japan), and Bablu Tirkey, research scholar
at the Tribal and Regional Languages Department, Ranchi University (India). It contains
approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27
speakers (22 males, 5 females). Also included are accompanying transcripts, English
translations and glosses for 6 hours of the collection. Speakers were asked to talk
about themselves, their lives, rituals and folklore elicitation interviews were then
conducted. The goal of the work was to present the current state and dialectal variation
of Malto. Malto is a Dravidian language spoken in northeastern India (principally
the states of Bihar, Jharkhand and West Bengal) and Bangladesh by people called the
Pahariyas. Indian census data places the number of Malto speakers in a range of between
100,000-200,000 total speakers. Most Malto speakers live in the three northeastern
districts of Jharkhand, i.e, Sahebganj, Godda and Pakur the fieldwork that resulted
in this corpus was conducted in those districts. Of the Pahariyas in that area, three
subtribes, the Sawriya Pahariyas, the Mal Pahariyas and the Kumarbhag Pahariyas, primarily
speak Malto. (Kobayashi 3) Pahariya villages or hamlets are located on hilly tracts
and in the lowlands are often separated by non-Parahiya villages. As a result, Malto
varies from village to village. It may be more accurate to consider Malto a continuum
of dialects rather than a unitary language. The three major dialects -- Sawriya Pahariya,
Mal Pahariya, and Kumarbhag Pahariya -- correspond to the principal sub-tribal communities.
(Kobayashi 14) For further reading on Malto, consult Texts and Grammar of Malto (2012)
by Masato Kobayashi published by Kotoba Books, Vizianagaram, India and sold by the
book distributors: Mary Martin Booksellers, 123 Third Street, Tatabad, Coimbatore
641012, India. They can be contacted at info@marymartin.com or at books.kotoba@gmail.com.
*Data* The transcribed data accounts for 6 hours of the collection and contains 21
speakers (17 male, 4 female). The untranscribed data accounts for 2 hours of the collection
and contains 10 speakers (9 male, 1 female). Four of the male speakers are present
in both groups. All audio is presented in .wav format. Each audio file name includes
a subject number, village name, speaker name and the topic discussed. The transcripts
and glossary are UTF-8 text files. Because of ambiguities that occur when writing
Malto in Devenagari script, the transcripts were developed using Roman script with
symbols adapted from the International Phonetic Alphabet (IPA) but are not considered
to be phonetic transcripts. Consult readme.txt and untran_speaker.txt for further
information about the corpus, its collection and the speakers. The transcription and
glosses are split into three text files consult the readme to determine which audio
files are covered by each transcript. *Sample* For a sample from this corpus, please
listen to this audio file.
LANGUAGE NOTE
- Language note:
Content in Mal Paharia, Sauria Paharia, and Kumarbhag Paharia. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Malto language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Malto language
- General subdivision:
Discourse analysis.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kobayashi, Masato
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tirkey, Bablu
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 285-048-567-623-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English Translation Treebank: An-Nahar Newswire
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
English Translation Treebank: An Nahar Newswire was developed by the Linguistic Data
Consortium (LDC). It consists of 599 distinct newswire stories from the Lebanese publication
An Nahar translated from Arabic to English and annotated for part-of-speech and syntactic
structure. This corpus is part of an ongoing effort at LDC to produce parallel Arabic
and English treebanks. The files in this release are parallel with those in Arabic
Treebank: Part 3 - v3.2 (LDC2010T08). Other parallel releases are (1) Arabic Treebank:
Part 1 v 2.0 (LDC2003T06) (Agence France Presse newswire) and Arabic Treebank: Part
1 - 10K-word English Translation (LDC2003T07) and (2) Arabic Treebank: Part 1 v 3.0
(POS with full vocalization + syntactic analysis) (LDC2005T02) (Agence France Presse
newswire) and its translated English counterpart English-Arabic Treebank v 1.0 (LDC2006T10).
The guidelines followed for both part-of-speech and syntactic annotation are Penn
Treebank II style, with changes in the tokenization of hyphenated words, part-of-speech
and tree changes necessitated by those tokenization changes and revisions to the syntactic
annotation to comply with the updated annotation guidelines (including the Treebank-PropBank
merge or Treebank IIa and treebank c changes). *Data* The data consists of 461,489
tokens in 599 individual files. The news stories in this release were published in
An Nahar in 2002. The English sources files (translated from the Arabic) were automatically
tokenized, part-of-speech tagged and parsed the tokens, tags and parses were manually
corrected. The quality control process consisted of a series of specific searches
for over 100 types of potential inconsistency and parse or annotation error. Any errors
found in those searches were manually corrected. The steps occurred in the following
order: * Automatic tokenization * Human correction * Automatic pre-tag and pre-parse
* Human annotation * QC correction * Automatic scripts for treebank c revisions Annotations
are in the following two formats: * Penn Style Trees * Bracketed tree files following
the basic form (NODE (TAG token)). Each sentence is surrounded by a pair of empty
parentheses. * AG xml * TreeEditor .xml stand-off annotation files. These files contain
the POS and Treebank annotation and reference the source files by character offset.
DTD files for the AG xml files were moved from their original location indicated in
the readme to be more consistence with LDC publications. *Sponsorship* This work was
sponsored in part by the Defense Advanced Research Projects Agency, GALE Program Grant
No. HR0011-06-1-0003. The content of this publication does not necessarily reflect
the position or policy of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Translations into English
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mott, Justin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Warner, Colin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kulick, Seth
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636088
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 288-567-224-925-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
USC-SFI MALACH Interviews and Transcripts English
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
USC-SFI MALACH Interviews and Transcripts English, LDC Catalog Number LDC2012S05 and
ISBN 1-58563-602-9, was developed by The University of Southern California Shoah Foundation
Institute (USC-SFI), the University of Maryland, IBM and Johns Hopkins University
as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains
approximately 375 hours of interviews from 784 interviewees along with transcripts
and other documentation. Inspired by his experience making Schindlers List, Steven
Spielberg established the Survivors of the Shoah Visual History Foundation in 1994
to gather video testimonies from survivors and other witnesses of the Holocaust. While
most of those who gave testimony were Jewish survivors, the Foundation also interviewed
homosexual survivors, Jehovah Witness survivors, liberators and liberation witnesses,
political prisoners, rescuers and aid providers, Roma and Sinti (Gypsy) survivors,
survivors of eugenics policies, and war crimes trials participants. Within several
years, the Foundations Visual History Archive held nearly 52,000 video testimonies
in 32 languages representing 56 countries. It is the largest archive of its kind in
the world. In 2006, the Foundation became part of the Dana and David Dornsife College
of Letters, Arts and Sciences at the University of Southern California in Los Angeles
and was renamed as the USC Shoah Foundation Institute for Visual History and Education.
The goal of the MALACH project was to develop methods for improved access to large
multinational spoken archives. The focus was advancing the state of the art of automatic
speech recognition (ASR) and information retrieval. The characteristics of the USC-SFI
collection -- unconstrained, natural speech filled with disfluencies, heavy accents,
age-related coarticulations, un-cued speaker and language switching and emotional
speech -- were considered well-suited for that task. The work centered on five languages:
English, Czech, Russian, Polish and Slovak. USC-SFI MALACH Interviews and Transcripts
English was developed for the English speech recognition experiments. *Data* The speech
data in this release was collected beginning in 1994 under a wide variety of conditions
ranging from quiet to noisy (e.g., airplane overflights, wind noise, background conversations
and highway noise). Original interviews were recorded on Sony Beta SP tapes, then
digitized into a 3 MB/s MPEG-1 stream with 128 kb/s (44 kHz) stereo audio. The sound
files in this release are compressed in MP3 format at a sampling frequency of 44.1
kHz. Approximately 25,000 of all USC-SFI collected interviews are in English and average
approximately 2.5 hours each. The 784 interviews included in this release are each
a 30 minute section of the corresponding larger interview. Due to the way the original
interviews were arranged on the tapes, some interviews are clipped and have a duration
of less than 30 minutes. Certain interviews include speech from family members in
addition to that of the subject and the interviewer. Accordingly, the corpus contains
speech from more than 784 speakers, who are more or less equally distributed between
males and females. The interviews also include accented speech over a wide range (e.g.,
Hungarian, Italian, Yiddish, German and Polish). This release includes transcripts
in .trs format of the first 15 minutes of each interview. The transcripts were created
using Transcriber 1.5.1 and later modified.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Holocaust, Jewish (1939-1945)
- Form subdivision:
Personal narratives
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Holocaust survivors
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Sociolinguistics.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Oral communication
- General subdivision:
Archival resources.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ramabhadran, Bhuvana
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gustman, Samuel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Byrne, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajič, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Oard, Douglas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Olsson, J. Scott
ADDED ENTRY--PERSONAL NAME
- Personal name:
Picheny, Michael
ADDED ENTRY--PERSONAL NAME
- Personal name:
Psutka, Josef
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636096
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012V01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 398-730-015-144-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012V01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News was developed
by researchers at the Department of Computer Science and Engineering, University of
South Florida (USF), Tampa, Florida and the Multimodal Information Group at the National
Institute of Standards and Technology (NIST). It contains approximately 60 hours of
English broadcast news video data collected by LDC in 1998 and annotated for the 2005
VACE (Video Analysis and Content Extraction) tasks. The tasks covered by the broadcast
news domain were human face (FDT) tracking, text strings (TDT) (glyphs rendered within
the video image for the text object detection and tracking task) and word level text
strings (TDT_Word_Level) (videotext OCR task). The VACE program was established to
develop novel algorithms for automatic video content extraction, multi-modal fusion,
and event understanding. During VACE Phases I and II, the program made significant
progress in the automated detection and tracking of moving objects including faces,
hands, people, vehicles and text in four primary video domains: broadcast news, meetings,
street surveillance, and unmanned aerial vehicle motion imagery. Initial results were
also obtained on automatic analysis of human activities and understanding of video
sequences. Three performance evaluations were conducted under the auspices of the
VACE program between 2004 and 2007. The 2005 evaluation was administered by USF in
collaboration with NIST and guided by an advisory forum including the evaluation participants.
LDC has previously released: * NIST/USF Evaluation Resources for the VACE Program
-- Meeting Data Training Set Part 1 LDC2011V01 * NIST/USF Evaluation Resources for
the VACE Program -- Meeting Data Training Set Part 2 LDC2011V02 * NIST/USF Evaluation
Resources for the VACE Program -- Meeting Data Test Set Part 1 LDC2011V03 * NIST/USF
Evaluation Resources for the VACE Program -- Meeting Data Test Set Part 2 LDC2011V04
* 2006 NIST/USF Evaluation Resources for the VACE Program -- Meeting Data Test Set
Part 1 LDC2011V05 * 2006 NIST/USF Evaluation Resources for the VACE Program -- Meeting
Data Test Set Part 2 LDC2011V06 This release is distributed on an external hard drive,
consult these instructions for help attaching and reading the drive. *Data* The broadcast
news recordings were collected by LDC in 1998 from the following sources: CNN Headline
News (CNN-HDL) and ABC World News Tonight (ABC-WNT). CNN HDL is a 24-hour/day cable-TV
broadcast which presents top news stories continuously throughout the day. ABC-WNT
is a daily 30-minute news broadcast that typically covers about a dozen different
news items. Each daily ABC-WNT broadcast and up to four 30-minute sections of CNN-HDL
were recorded each day. The CNN segments were drawn from that portion of the daily
schedule that happened to include closed captioning. The broadcasts were captured
directly from a cable TV connection. The signal first went to a VCR which was programmed
to record the broadcast on VHS tape. These were later recorded in MPEG-2 (29.97 fps
at 720 x 480p) for use in the VACE program. The MPEG-2 versions are contained in this
release. *Tools* The VACE evaluation tools have been integrated into NISTs downloadable
Framework for Detection Evaluation (F4DE) Toolkit. The toolkit contains small example
files for each of the task/object/domain scoring combinations.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Human face recognition (Computer science)
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Content-based image retrieval
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Digital video
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kasturi, Rangachar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Goldgof, Dmitry
ADDED ENTRY--PERSONAL NAME
- Personal name:
Manohar, Vasant
ADDED ENTRY--PERSONAL NAME
- Personal name:
Soundararajan, Padmanabhan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garofolo, John S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bowers, Rachel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rose, Travis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Michel, Martial
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012V01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u spa d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 817-755-702-726-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
ger
- Language code of text/sound track or separate title:
cze
- Language code of text/sound track or separate title:
cat
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
deu
- Language code of text/sound track or separate title:
ces
- Language code of text/sound track or separate title:
cat
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2009 CoNLL Shared Task Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2009 CoNLL Shared Task Part 1, LDC Catalog Number LDC2012T03 and ISBN 1-58563-610-X,
contains the Catalan, Czech, German and Spanish trial corpora, training corpora, development
and test data for the 2009 CoNLL (Conference on Computational Natural Language Learning)
Shared Task Evaluation. The 2009 Shared Task developed syntactic dependency annotations,
including the semantic dependencies model roles of both verbal and nominal predicates.
The Conference on Computational Natural Language Learning (CoNLL) is accompanied every
year by a shared task intended to promote natural language processing applications
and evaluate them in a standard setting. The 2004 and 2005 CoNLL shared tasks were
dedicated to semantic role labeling (SRL) in a monolingual setting (English). In 2006
and 2007, the shared tasks were devoted to the parsing of syntactic dependencies and
used corpora from up to thirteen languages. In 2008, the shared task focused on English
and employed a unified dependency-based formalism and merged the task of syntactic
dependency parsing and the task of identifying semantic arguments and labeling them
with semantic roles that data has been released by LDC as 2008 CoNLL Shared Task Data.
The 2009 task extended the 2008 task to several languages (English plus Catalan, Chinese,
Czech, German, Japanese and Spanish). Among the new features were comparison of time
and space complexity based on participants input, and learning curve comparison for
languages with large datasets. The 2009 shared task was divided into two subtasks:
* parsing syntactic dependencies * identification of arguments and assignment of semantic
roles for each predicate 2009 CoNLL Shared Task Part 2 (LDC2012T04) contains the English
and Chinese task data and is also available through LDC. LDC has also released the
following CoNLL Shared Task data sets: * 2006 CoNLL Shared Task - Ten Languages (LDC2015T11)
* 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12) * 2008 CoNLL Shared Task Data
(LDC2009T12) * 2015-2016 CoNLL Shared Task (LDC2017T13) *Data* The materials in this
release consist of excerpts from the following corpora: * Ancora (Spanish + Catalan):
500,000 words each of annotated news text developed by the University of Barcelona,
Polytechnic University of Catalonia, the University of Alacante and the University
of the Basque Country * Prague Dependency Treebank 2.0 (Czech): approximately 2 million
words of annotated news, journal and magazine text developed by Charles University
also available through LDC, LDC2006T01 * TIGER Treebank + SALSA Corpus (German): approximately
900,000 words of annotated news text and FrameNet annotation developed by the University
of Potsdam, Saarland University and the University of Stuttgart In addition, an archive
of all of the uploaded data from the participants is included in the eval-data folder.
Users should note that not all data indicated in the individual READMEs is included
in this release and neither are some of the corresponding DTDs for of the XML. Additionally,
all data is presented in its uncompressed form for ease of use. Within the user eval-data
folder, the two folders marked bad contain references to data from languages included
in Part 2 of this release as well as to Japanese data. Japanese data is not included
in this release.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish, German, Czech, and Catalan. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajič, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maria Antonia Martí
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marquez, Lluis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Nivre, Joakim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Štěpánek, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Padó, Sebastian
ADDED ENTRY--PERSONAL NAME
- Personal name:
Straňák, Pavel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636118
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 088-658-711-565-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2009 CoNLL Shared Task Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2009 CoNLL Shared Task Part 2, LDC Catalog Number LDC2012T04 and ISBN 1-58563-611-8,
contains the Chinese and English trial corpora, training corpora, development and
test data for the 2009 CoNLL (Conference on Computational Natural Language Learning)
Shared Task Evaluation. The 2009 Shared Task developed syntactic dependency annotations,
including the semantic dependencies model roles of both verbal and nominal predicates.
The Conference on Computational Natural Language Learning (CoNLL) is accompanied every
year by a shared task intended to promote natural language processing applications
and evaluate them in a standard setting. The 2004 and 2005 CoNLL shared tasks were
dedicated to semantic role labeling (SRL) in a monolingual setting (English). In 2006
and 2007, the shared tasks were devoted to the parsing of syntactic dependencies and
used corpora from up to thirteen languages. In 2008, the shared task focused on English
and employed a unified dependency-based formalism and merged the task of syntactic
dependency parsing and the task of identifying semantic arguments and labeling them
with semantic roles that data has been released by LDC as 2008 CoNLL Shared Task Data.
The 2009 task extended the 2008 task to several languages (English plus Catalan, Chinese,
Czech, German, Japanese and Spanish). Among the new features were comparison of time
and space complexity based on participants input, and learning curve comparison for
languages with large datasets. The 2009 shared task was divided into two subtasks:
* parsing syntactic dependencies * identification of arguments and assignment of semantic
roles for each predicate 2009 CoNLL Shared Task Part 1 (LDC2012T03) contains the Catalan,
Czech, German and Spanish task data and is also available through LDC. LDC has also
released the following CoNLL Shared Task data sets: * 2006 CoNLL Shared Task - Ten
Languages (LDC2015T11) * 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12) * 2008
CoNLL Shared Task Data (LDC2009T12) * 2015-2016 CoNLL Shared Task (LDC2017T13) *Data*
The materials in this release consist of excerpts from the following corpora: * Penn
Treebank II (LDC95T7) (English): over one million words of annotated English newswire
and other text developed by the University of Pennsylvania * PropBank (LDC2004T14)
(English): semantic annotation of newswire text from Treebank-2 developed by the University
of Pennsylvania * NomBank (LDC2008T23) (English): argument structure for instances
of common nouns in Treebank-2 and Treebank-3 (LDC99T42) texts developed by New York
University * Chinese Treebank 6.0 (LDC2007T36) (Chinese): 780,000 words (over 1.28
million characters) of annotated Chinese newswire, magazine and administrative texts
and transcripts from various broadcast news programs developed by the University of
Pennsylvania and the University of Colorado * Chinese Proposition Bank 2.0 (LDC2008T07)
(Chinese): predicate-argument annotation on 500,000 words from Chinese Treebank 6.0
developed by the University of Pennsylvania and the University of Colorado In addition,
an archive of all of the uploaded data from the participants is included in the eval-data
folder. Users should note that not all data indicated in the individual READMEs is
included in this release and neither are some of the corresponding DTDs for the XML.
Additionally, all data is presented in its uncompressed form for ease of use. Within
the user eval-data folder, the two folders marked bad contain references to data from
languages included in Part 1 of this release as well as to Japanese data. Japanese
data is not included in this release.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajič, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ciaramita, Massimiliano
ADDED ENTRY--PERSONAL NAME
- Personal name:
Johansson, Richard
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meyers, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Štěpánek, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Nivre, Joakim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Straňák, Pavel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Surdeanu, Mihai
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen (Bert)
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636126
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 475-765-099-443-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Dependency Treebank 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Dependency Treebank 1.0 was developed by the Harbin Institute of Technologys
Research Center for Social Computing and Information Retrieval (HIT-SCIR). It contains
49,996 Chinese sentences (902,191 words) randomly selected from Peoples Daily newswire
stories published between 1992 and 1996 and annotated with syntactic dependency structures.
*Data* Ill-formed or short sentences were eliminated from the randomly-selected sentences
prior to annotation. The data was segmented and annotated for part of speech (POS),
syntactic structures, verb subclasses and noun compounds.Word segmentation and POS
tagging were accomplished automatically using statistical models trained on a larger,
annotated corpus of Peoples Daily newswire stories. Humans manually annotated the
syntactic structures and corrected word segmentation errors. POS tags were not corrected.
The data is provided in the format of CoNLL-X and in UTF-8. One line presents information
for one word. An empty line indicates the end of a sentence. Each line contains 10
columns separated with a tab.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Che, Wanxiang
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Zhenghua
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636134
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 046-763-102-667-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 was developed by LDC.
Along with other corpora, the parallel text in this release comprised machine translation
training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source text and corresponding
English translations selected from broadcast conversation (BC) data collected by LDC
between 2004 and 2007 and transcribed by LDC or under its direction. LDC has released
the following GALE Phase 1 & 2 Arabic Parallel Text data sets: * GALE Phase 1 Arabic
Broadcast News Parallel Text - Part 1 (LDC2007T24) * GALE Phase 1 Arabic Broadcast
News Parallel Text - Part 2 (LDC2008T09) * GALE Phase 1 Arabic Blog Parallel Text
(LDC2008T02) * GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03) *
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09) * GALE Phase 2 Arabic
Broadcast Conversation Parallel Text Part 1 (LDC2012T06) * GALE Phase 2 Arabic Broadcast
Conversation Parallel Text Part 2 (LDC2012T14) * GALE Phase 2 Arabic Newswire Parallel
Text (LDC2012T17) * GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18)
* GALE Phase 2 Arabic Web Parallel Text (LDC2013T01) *Data* GALE Phase 2 Arabic Broadcast
Conversation Parallel Text Part 1 includes 36 source-translation document pairs, comprising
169,109 words of Arabic source text and its English translation. Data is drawn from
thirteen distinct Arabic programs broadcast between 2004 and 2007 from the following
sources: Al Alam News Channel, a broadcaster located in Iran Aljazeera, a regional
broadcast programmer based in Doha, Qatar Dubai TV, located in Dubai, United Arab
Emirates Oman TV, a national broadcaster located in the Sultanate of Oman and Radio
Sawa, a U.S, government-funded regional broadcaster. Broadcast conversation programming
is generally more interactive than traditional news broadcasts and includes talk shows,
interviews, call-in programs and roundtable discussions. The programs in this release
focus on current events topics. The files in this release were transcribed by LDC
staff and/or transcription vendors under contract to LDC in accordance with Quick
Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries
in addition to transcribing the text. Data was manually selected for translation according
to several criteria, including linguistic features, transcription features and topic
features. The transcribed and segmented files were then reformatted into a human-readable
translation format and assigned to translation vendors. Translators followed LDCs
Arabic to English translation guidelines which are included with this release. Bilingual
LDC staff performed quality control procedures on the completed translations. Source
data and translations are distributed in TDF format. TDF files are tab-delimited text
files containing one segment of text along with meta information about that segment.
Each field in the TDF file is described in TDF_format.txt. All data are encoded in
UTF8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u tur d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636142
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 831-432-792-126-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Turkish Broadcast News Speech and Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Turkish Broadcast News Speech and Transcripts was developed by Bogaziçi University,
Istanbul, Turkey and contains approximatley 130 hours of Voice of America (VOA) Turkish
radio broadcasts and corresponding transcripts. This is part of a larger corpus of
Turkish broadcast news data collected and transcribed with the goal to facilitate
research in Turkish automatic speech recognition and its applications, such as speech
retrieval. The VOA material was collected between December 2006 and June 2009 using
a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded
from analog FM radio the 2009 broadcasts were recorded from digitial satellite transmissions.
A quick manual segmentation and transcription approach was followed. Speech recognition
and retrieval experiments using the larger corpus can be found in the following journal
article: Ebru Arisoy, Dogan Can, Siddika Parlak, Hasim Sak, and Murat Saraclar, Turkish
Broadcast News Speech and Transcripts Transcription and Retrieval, IEEE Transactions
on Audio, Speech and Language Processing, 17(5):874-883, July 2009. For more information
please visit http://busim.ee.boun.edu.tr/~speech or contact the principal investigator,
Murat Saraçlar. *Data* The data was recrded at 32 kHz and resampled at 16 kHz. After
screening for recording quality, the files were segmented, transcribed, and verified.
The segmentation occurred in two steps, an initial automatic segmentation followed
by manual correction and annotation which included information such as background
conditions and speaker boundaries. The transcription guidelines were adapted from
the LDC HUB4 and quick transcription guidelines. An English version of the adapted
guidelines is provided with the data here. The manual segmentations and transcripts
were created by native Turkish speakers at Bo?aziçi University using Transcriber.
The transcriptions are provided in the ISO-8859-9 (Latin5) character set.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Turkish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Turkish language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Saraçlar, Murat
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636150
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 009-748-438-749-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank - Broadcast News v1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Treebank - Broadcast News v1.0 was developed at the Linguistic Data Consortium
(LDC). It consists of 120 transcribed Arabic broadcast news stories with part-of-speech,
morphology, gloss and syntactic tree annotation. The ongoing PATB project supports
research in Arabic-language natural language processing and human language technology
development. The methodology and work leading to the release of this publication are
described in detail in the documentation accompanying this corpus. *Data* This release
contains 432,976 source tokens before clitics were split, and 517,080 tree tokens
after clitics were separated for treebank annotation. The source materials are Arabic
broadcast news stories collected by LDC during the period 2005-2008 from the following
sources: Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Baghdadya TV, Al Fayha,
Alhurra, Al Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiyah, Dubai TV, Kuwait TV,
Lebanese Broadcasting Corp., Oman TV, Radio Sawa, Saudi TV and Syria TV. The transcripts
were produced by LDC.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Broadcast journalism
- Form subdivision:
Databases.
- Geographic subdivision:
Arabian Peninsula
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kulick, Seth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krouna, Sondos
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tabassi, Dalila
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ciul, Michael
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636169
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 443-974-834-414-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Prague Czech-English Dependency Treebank 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Prague Czech-English Dependency Treebank (PCEDT) 2.0 was developed by the Institute
of Formal and Applied Linguistics at Charles University in Prague, Czech Republic.
It is a corpus of Czech-English parallel resources translated, aligned and manually
annotated for dependency structure, semantic labeling, argument structure, ellipsis
and anaphora resolution. This release updates Prague Czech-English Dependency Treebank
1.0 (LDC2004T25) by adding English newswire texts so that it now contains over two
million words in close to 100,000 sentences. *Data* The principal new material in
PCEDT 2.0 is the inclusion of the entire Wall Street Journal data from Treebank-3
(LDC99T42). Not included from PCEDT 1.0 are the Readers Digest material, the Czech
monolingual corpus, and the English-Czech dictionary. Each section is enhanced with
a comprehensive manual linguistic annotation in the Prague Dependency Treebank style
(LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation
style are: * dependency structure of the content words and coordinating and similar
structures (function words are attached as their attribute values) * semantic labeling
of content words and types of coordinating structures * argument structure, including
an argument structure (valency) lexicon for both languages * ellipsis and anaphora
resolution This annotation style is called tectogrammatical annotation, and it constitutes
the tectogrammatical layer in the corpus. Please consult the PCEDT website for more
information and documentation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Czech. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Czech language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajič, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajičová, Eva
ADDED ENTRY--PERSONAL NAME
- Personal name:
Panevová, Jarmila
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sgall, Petr
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cinková, Silvie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fučíková, Eva
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mikulová, Marie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pajas, Petr
ADDED ENTRY--PERSONAL NAME
- Personal name:
Popelka, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Semecký, Jiří
ADDED ENTRY--PERSONAL NAME
- Personal name:
Šindlerová, Jana
ADDED ENTRY--PERSONAL NAME
- Personal name:
Štěpánek, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Toman, Josef
ADDED ENTRY--PERSONAL NAME
- Personal name:
Urešová, Zdeňka
ADDED ENTRY--PERSONAL NAME
- Personal name:
Žabokrtský, Zdeněk
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636177
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 833-244-299-656-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
sem
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arz
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
ajp
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic-Dialect/English Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic-Dialect/English Parallel Text was developed by Raytheon BBN Technologies (BBN),
LDC and Sakhr Software and contains approximately 3.5 million tokens of Arabic dialect
sentences and their English translations. *Data* The data in this corpus consists
of Arabic web text as follows: 1. Filtered automatically from large Arabic text corpora
harvested from the web by LDC. The LDC corpora consisted largely of weblog and online
user groups and amounted to around 350 million Arabic words. Documents that contained
a large percentage of non-Arabic or Modern Standard Arabic (MSA) words were eliminated.
A list of dialect words was manually selected by culling through the Levantine Fisher
(LDC2005S07, LDC2005T03, LDC2007S02 and LDC2007T04) and Egyptian CALLHOME speech corpora
(LDC97S45, LDC2002S37, LDC97T19 and LDC2002T38) distributed by LDC. That list was
then used to retain documents that contained a certain number of matches. The resulting
subset of the web corpora contained around four million words. Documents were automatically
segmented into passages using formatting information from the raw data. 2. Manually
harvested by Sakhr Software from Arabic dialect web sites. Dialect classification
and sentence segmentation, as needed, and translation into English were performed
by BBN through Amazons Mechanical Turk. Arabic annotators from Mechanical Turk classified
filtered passages as being either MSA or one of four regional dialects: Egyptian,
Levantine, Gulf/Iraqi or Maghrebi. An additional General dialect option was allowed
for ambiguous passages. The classification was applied to whole passages rather than
individual sentences. Only the passages labeled Levantine and Egyptian were further
processed. The segmented Levantine and Egyptian sentences were then translated. Annotators
were instructed to translate completely and accurately and to transliterate Arabic
names. They were also provided with examples. All segments of a passage were presented
in the same translation task to provide context.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Egyptian Arabic, North Levantine Arabic, and South Levantine Arabic.
Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Dialects
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Translations into English
- Form subdivision:
Databases.
- General subdivision:
Dialects
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Raytheon BBN Technologies
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ADDED ENTRY--PERSONAL NAME
- Personal name:
Software, Sakhr
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u cat d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636185
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 442-580-062-511-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
cat
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cat
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Catalan TimeBank 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Catalan TimeBank 1.0 was developed by researchers at Barcelona Media and consists
of Catalan texts in the AnCora corpus annotated with temporal and event information
according to the TimeML specification language. TimeML (Pusteyovsky, et al., 2005)
is a schema for annotationg eventualities and time expressions in natural language
as well as the temporal relations among them, thus facilitating the task of extraction,
representation and exchange of temporal information. Catalan Timebank 1.0 is annotated
in three levels, marking events, time expressions and event metadata. The TimeML annotation
scheme was tailored for the specifics of the Catalan language. Temporal relations
in Catalan present distinctions of verbal mood (e.g., indicative, subjunctive, conditional,
etc.) and grammatical aspect (e.g., imperfective) which are absent in English. Catalan
TimeBank 1.0 joins the family of TimeBank annotated corpora which includes languages
such as English, Spanish, Italian, French, Korean and Chinese. Through their common
layer of annotation, these corpora provide resoures useful for multilingual temporal
extraction and processing, such as multilingual text entailment, opinion mining or
question answering. LDC has released the following corpora incorporating TimeBank
annotation: TimeBank 1.2 LDC2006T08, FactBank 1.0 LDC2009T23 and ModeS TimeBank 1.0
LDC2012T01. *Data* Catalan TimeBank 1.0 contains stand-off annotations for 210 documents
with over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding
punctuation). The source documents are from the EFE news agency, the ACN Catalan news
agency2 and the Catalan version of the El Períodico newspaper, and span the period
from January to December 2000. The AnCora corpus is the largest multilayer annotated
corpus of Spanish and Catalan. AnCora contains 400,000 words in Spanish and 275,000
words in Catalan. The AnCora documents are annotated on many linguistic levels including
stucture, syntax, dependencies, semantics and pragmatics.That information is not included
in this release, but it can be mapped to the present annotations. The data contained
in the AnCora corpus has been used in several international natural language processing
evaluations such as CoNLL-2006, CoNLL-2007 and SemEval-2007. The corpus is freely
available from the Centre de Llenguatge i Computació (CLiC).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Catalan. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Catalan language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Grammar, Comparative and general
- Form subdivision:
Databases.
- General subdivision:
Temporal clauses
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Grammar, Comparative and general
- Form subdivision:
Databases.
- General subdivision:
Temporal constructions
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sauri, Roser
ADDED ENTRY--PERSONAL NAME
- Personal name:
Badia, Toni
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636193
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 756-362-661-905-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
American English Nickname Collection
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
American English Nickname Collection was developed by Intelius, Inc. and is a compilation
of American English nicknames to given name mappings based on information in US government
records, public web profiles and financial and property reports. This corpus is intended
as a tool for the quantitative study of nickname usage in the United States such as
in demographic and sociological studies. It has multiple potential human language
technology applications as well, including entity extraction, coreference resolution,
people search, language modeling and machine translation. *Data* The American English
Nickname Collection contains 331,237 distinct mappings encompassing millions of names.
The data was collected and processed through a record linkage pipeline. The steps
in the pipeline were (1) data cleaning, (2) blocking, (3) pair-wise linkage and (4)
clustering. In the cleaning step, material was categorized, processed to remove junk
and spam records and normalized to an approximately common representation. The blocking
process utitlized an algorithm to group records by shared properties for determining
which record pairs should be examined by the pairwise linker as potential duplicates.
The linkage step assigned a score to record pairs using a supervised pairwise-based
machine learning model. The clustering step combined record pairs into connected components
and further partitioned each connected component to remove inconsistent pairwise links.
The result is that input records were partitioned into disjoint sets called profiles,
where each profile corresponded to a single person. The material is presented in the
form of a comma delimited text file. Each line contains a first name, a nickname or
alias, its conditional probability and its frequency. The conditional probability
for each nickname is derived from the base data using an algorithim which calculates
both the probability for which any alias refers to a given name and a threshold below
which the mapping is most likely an error. This threshold eliminates typographic errors
and other noise from the data.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Nicknames
- Form subdivision:
Databases.
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Names, English
- Form subdivision:
Databases.
- Geographic subdivision:
United States
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Carvalho, Vitor R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kiran, Yigit
ADDED ENTRY--PERSONAL NAME
- Personal name:
Borthwick, Andrew
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636207
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 422-097-648-917-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Spanish TimeBank 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Spanish TimeBank 1.0 was developed by researchers at Barcelona Media and consists
of Spanish texts in the AnCora corpus annotated with temporal and event information
according to the TimeML specification language. TimeML (Pusteyovsky, et al., 2005)
is a schema for annotating eventualities and time expressions in natural language
as well as the temporal relations among them, thus facilitating the task of extraction,
representation and exchange of temporal information. Spanish Timebank 1.0 is annotated
in three levels, marking events, time expressions and event metadata. The TimeML annotation
scheme was tailored for the specifics of the Spanish language. Temporal relations
in Spanish present distinctions of verbal mood (e.g., indicative, subjunctive, conditional,
etc.) and grammatical aspect (e.g., imperfective) which are absent in English. Spanish
TimeBank 1.0 joins the family of TimeBank annotated corpora which includes languages
such as English, Italian, French, Korean and Chinese. Through their common layer of
annotation, these corpora provide resources useful for multilingual temporal extraction
and processing, such as multilingual text entailment, opinion mining or question answering.
Spanish Timebank 1.0 is the Spanish language complement to Catalan Timebank 1.0 LDC2012T10.
LDC has released other corpora incorporating TimeBank annotation: TimeBank 1.2 LDC2006T08,
FactBank 1.0 LDC2009T23 and ModeS TimeBank 1.0 LDC2012T01. *Data* Spanish TimeBank
1.0 contains stand-off annotations for 210 documents with over 75,800 tokens (including
punctuation marks) and 68,000 tokens (excluding punctuation). The source documents
are news stories and fiction from the AnCora corpus. The AnCora corpus is the largest
multilayer annotated corpus of Spanish and Catalan. AnCora contains 400,000 words
in Spanish and 275,000 words in Catalan. The AnCora documents are annotated on many
linguistic levels including stucture, syntax, dependencies, semantics and pragmatics.
That information is not included in this release, but it can be mapped to the present
annotations. The data contained in the AnCora corpus has been used in several international
natural language processing evaluations such as CoNLL-2006, CoNLL-2007 and SemEval-2007.
The corpus is freely available from the Centre de Llenguatge i Computació (CLiC).
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Spanish language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sauri, Roser
ADDED ENTRY--PERSONAL NAME
- Personal name:
Badia, Toni
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636215
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 230-396-178-102-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English Web Treebank
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
English Web Treebank was developed by the Linguistic Data Consortium (LDC) with funding
through a gift from Google Inc. It consists of over 250,000 words of English weblogs,
newsgroups, email, reviews and question-answers manually annotated for syntactic structure
and is designed to allow language technology researchers to develop and evaluate the
robustness of parsing methods in those web domains. *Data* This release contains 254,830
word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated
for sentence- and word-level tokenization, part-of-speech, and syntactic structure.
The data is roughly evenly divided across five genres: weblogs, newsgroups, email,
reviews, and question-answers. The files were manually annotated following the sentence-level
tokenization guidelines for web text and the word-level tokenization guidelines developed
for English treebanks in the DARPA GALE project. Only text from the subject line and
message body of posts, articles, messages and question-answers were collected and
annotated. Weblogs are interactive web sites that display content as discrete entries
or posts and allow viewers to comment on entries and engage in discussions. They are
typically managed by individuals and use informal or colloquial language. The weblog
data in this release was collected by LDC and covers the period 2003-2006. Newsgroups
are repositories of online discussions pertaining to a topic or interest area. They
consist of threads that in turn contain articles with comments and discussion from
group users. The newsgroup data in this release was collected by LDC and covers the
period 2003-2006. Email are messages sent to discrete individuals or well defined
groups via the TCP-IP Simple Mail Transfer Protocol (SMTP). The email messages in
this corpus are a subset of emails sent by Enron Corporation employees during the
period 1999-2002. Specifically, those messages are contained in the Enronsent Corpus,
a collection of 96,107 email messages from the sent folders of Enron email users which
were processed to remove any content not generated by human users. The reviews in
this corpus were gleaned from online reviews of businesses and services on various
Google web sites written by individuals. This information was provided to LDC by Google
in 2011 the dates of individual reviews are not available. Question-answers are posts
from Yahoo!s community-driven question-answering web site, Yahoo! Answers, where individuals
submit and answer questions which may be on any topic. This data was collected in
2011; the dates of individual question-answers were not collected.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mott, Justin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Warner, Colin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kulick, Seth
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636223
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 458-291-139-032-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 was developed by the
Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this
release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Modern Standard Arabic source text and
corresponding English translations selected from broadcast conversation (BC) data
collected by LDC between 2004 and 2007 and transcribed by LDC or under its direction.
LDC has released the following GALE Phase 1 & 2 Arabic Parallel Text data sets: *
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24) * GALE Phase
1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09) * GALE Phase 1 Arabic
Blog Parallel Text (LDC2008T02) * GALE Phase 1 Arabic Newsgroup Parallel Text - Part
1 (LDC2009T03) * GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (LDC2009T09)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06) * GALE
Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14) * GALE Phase
2 Arabic Newswire Parallel Text (LDC2012T17) * GALE Phase 2 Arabic Broadcast News
Parallel Text (LDC2012T18) * GALE Phase 2 Arabic Web Parallel Text (LDC2013T01) *Data*
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 includes 29 source-translation
document pairs, comprising 169,488 words of Arabic source text and its English translation.
Data is drawn from eight distinct Arabic programs broadcast between 2004 and 2007
from Aljazeera, a regional broadcast programmer based in Doha, Qatar and Nile TV,
an Egyptian broadcaster. Broadcast conversation programming is generally more interactive
than traditional news broadcasts and includes talk shows, interviews, call-in programs
and roundtables. The programs in this release focus on current events topics. The
files in this release were transcribed by LDC staff and/or transcription vendors under
contract to LDC in accordance with the Quick Rich Transcription guidelines developed
by LDC. Transcribers indicated sentence boundaries in addition to transcribing the
text. Data was manually selected for translation according to several criteria, including
linguistic features, transcription features and topic features. The transcribed and
segmented files were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDC's Arabic to English translation
guidelines. Bilingual LDC staff performed quality control procedures on the completed
translations. Source data and translations are distributed in TDF format. TDF files
are tab-delimited files containing one segment of text along with meta information
about that segment. Each field in the TDF file is described in TDF_format.text. All
data are encoded in UTF-8. *Sponsorship* This work was supported in part by the Defense
Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content
of this publication does not necessarily reflect the position or the policy of the
Government, and no official endorsement should be inferred.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 582-702-931-027-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web
was developed by the Linguistic Data Consortium (LDC) and contains 150,068 tokens
of word aligned Chinese and English parallel text enriched with linguistic tags. This
material was used as training data in the DARPA GALE Global Autonomous Language Exploitation)
program. Some approaches to statistical machine translation include the incorporation
of linguistic knowledge in word aligned text as a means to improve automatic word
alignment and machine translation quality. This is accomplished with two annotation
schemes: alignment and tagging. Alignment identifies minimum translation units and
translation relations by using minimum-match and attachment annotation approaches.
A set of word tags and alignment link tags are designed in the tagging scheme to describe
these translation units and relations. Tagging adds contextual, syntactic and language-specific
features to the alignment annotation. *Data* This release consists of Chinese source
newswire and web data (newsgroup, weblog) collected by LDC in 2008. The distribution
by genre, words, character tokens and segments appears below: Language Genre Files
Words CharTokens Segments Chinese nw 193 53279 79919 2016 Chinese wb 87 46766 70149
2357 Total 280 100045 150068 4373 Note that all token counts are based on the Chinese
data only. One token is equivalent to one character and one word is equivalent to
1.5 characters. The Chinese word alignment tasks consisted of the following components:
* Identifying, aligning, and tagging 8 different types of links * Identifying, attaching,
and tagging local-level unmatched words * Identifying and tagging sentence/discourse-level
unmatched words * Identifying and tagging all instances of Chinese 的 (DE) except when
they were a part of a semantic link. The file names indicate the source provider,
the story date and the language. For example, AFP_CMN_20080406 refers to the source
Agence France Presse (AFP), the story date is April 6, 2008 and the language is Chinese
(CMN).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Machine translating
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636746
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 766-428-479-143-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multi-Channel WSJ Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multi-Channel WSJ Audio (MCWSJ) was developed by the Centre for Speech Technology
Research at The University of Edinburgh and contains approximately 100 hours of recorded
speech from 45 British English speakers. Participants read Wall Street Journal texts
published in 1987-1989 in three recording scenarios: a single stationary speaker,
two stationary overlapping speakers and one single moving speaker. This corpus was
designed to address the challenges of speech recognition in meetings, which often
occur in rooms with non-ideal acoustic conditions and significant background noise,
and may contain large sections of overlapping speech. Using headset microphones represents
one approach, but meeting participants may be reluctant to wear them. Microphone arrays
are another option. MCWSJ supports research in large vocabulary tasks using microphone
arrays. The news sentences read by speakers are taken from WSJCAM0 Cambridge Read
News, a corpus originally developed for large vocabulary continuous speech recognition
experiments, which in turn was based on CSR-1 (WSJ0) Complete, made available by LDC
to support large vocabulary continuous speech recognition initiatives. *Data* Speakers
reading news text from prompts were recorded using a headset microphone, a lapel microphone
and an eight-channel microphone array. In the single speaker scenario, participants
read from six fixed positions. Fixed positions were assigned for the entire recording
in the overlapping scenario. For the moving scenario, participants moved from one
position to the next while reading. Fifteen speakers were recorded for the single
scenario, nine pairs for the overlapping scenario and nine individuals for the moving
scenario. Each read approximately 90 sentences. The audio data are presented as single
channel 16kHz flac compressed wav files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lincoln, Mike
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zwyssig, Erich
ADDED ENTRY--PERSONAL NAME
- Personal name:
McCowan, Iain
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636231
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 810-747-680-852-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
MADCAT Phase 1 Training Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phase
1 Training Set contains all training data created by the Linguistic Data Consortium
(LDC) to support Phase 1 of the DARPA MADCAT Program. The material in this release
consists of handwritten Arabic documents, scanned at high resolution and annotated
for the physical coordinates of each line and token. Digital transcripts and English
translations of each document are also provided, with the various content and annotation
layers integrated in a single MADCAT XML output. The goal of the MADCAT program is
to automatically convert foreign text images into English transcripts. MADCAT Phase
1 data was collected by LDC from Arabic source documents in three genres: newswire,
weblog and newsgroup text. Arabic speaking scribes copied documents by hand, following
specific instructions on writing style (fast, normal, careful), writing implement
(pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were
processed to optimize their appearance for the handwriting task, which resulted in
some original source documents being broken into multiple pages for handwriting. Each
resulting handwritten page was assigned to up to five independent scribes, using different
writing conditions. The handwritten, transcribed documents were checked for quality
and completeness, then each page was scanned at a high resolution (600 dpi, greyscale)
to create a digital version of the handwritten document. The scanned images were then
annotated to indicate the physical coordinates of each line and token. Explicit reading
order was also labeled, along with any errors produced by the scribes when copying
the text. The final step was to produce a unified data format that takes multiple
data streams and generates a single xml output file which contains all required information.
The resulting xml file has these distinct components: a text layer that consists of
the source text, tokenization and sentence segmentation an image layer that consist
of bounding boxes a scribe demographic layer that consists of scribe ID and partition
(train/test) and a document metadata layer. LDC has also released: * MADCAT Phase
2 Training Set (LDC2013T09) * MADCAT Phase 3 Training Set (LDC2013T15) * MADCAT Chinese
Pilot Training Set (LDC2014T13) *Data* This release includes 9,693 annotation files
in MADCAT XML format (.madcat.xml) along with their corresponding scanned image files
in TIFF format. Files are named as follows: * galeID_page#_scribeID.{tif|madcat.xml}
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Written Arabic
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Translating into English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doermann, Dave
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u fre d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636622
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013L01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 573-342-913-646-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
man
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
emk
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Maninkakan Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013L01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Maninkakan Lexicon was developed by LDC and contains 5,834 entries of the Maninkakan
language presented as a Maninkakan-English lexicon and a Maninkakan-French lexicon.
It is the second publication in an ongoing LDC project to to build an electronic dictionary
of three Mandekan languages: Mawukakan, Maninkakan and Bambara. These are Eastern
Manding languages in the Mande Group of the Niger-Congo language family. LDC released
a Mawukakan Lexicon (LDC2005L01) in 2005 and a Bamanankan Lexicon (LDC2016L01) in
2016. There are approximately 3.5 million Maninkakan speakers in West Africa, mostly
in Guinea and Mali, and also in Liberia, Senegal, Sierra Leone and Ivory Coast. The
word Maninkakan is composed of three lexemes: (1) Mande or Manden, the name of the
territory occupied by the people who speak the language, (2) the suffix -ka which
when added derives the name of the inhabitant of Mande or Manden, and (3) kan, which
means language. Thus Maninkakan is the language of the people who live in Mande/Manden.
Mandekan, Mandenkan, Maninka and Malinke are all used to refer to the language of
the inhabitants of the Mande/Manden. Meghan Glenn served as an editor for the French
and English parts of this Lexicon. More information about the work of LDC in the languages
of West Africa and the challenges those languages present for language resource development
can be found here. *Data* Maninkakan is written using Latin script, Arabic script
and the NKo alphabet. This lexicon is presented using a Latin-based transcription
system because the Latin alphabet is familiar to the majority of Mandekan language
speakers and because it is expected to facilitate the work of researchers interested
in this resource. The dictionary is provided in two formats, Toolbox and XML. Toolbox
is a version of the widely used SIL Shoebox program adapted to display Unicode. Toolbox
can be downloaded for free from this link, http://www-01.sil.org/computIng/catalog/show_software.asp?id=79.
The Toolbox files are provided in two fonts, Arial and Doulous SIL. The Arial files
should display using the Arial font which is standard on most operating systems. Doulous
SIL, available as a free download, is a robust font that should display all characters
without issue. Users should launch Toolbox using the *.prj files in the Arial or Doulous_SIL
folders. The lexicon is presented in Unicode Normalization Form D, canonical decomposition.
This means that all glyphs are divided into as many parts as possible. See the following
link for more information on Unicode normalization forms. The XML formatted lexicon
was generated by Toolbox and a DTD is included.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in French, English, and Eastern Maninkakan. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bamba, Moussa
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013L01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636258
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 912-636-000-365-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Newswire Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Newswire Parallel Text was developed by the Linguistic Data Consortium
(LDC). Along with other corpora, the parallel text in this release comprised training
data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
This corpus contains Modern Standard Arabic source text and corresponding English
translations selected from newswire data collected in 2007 by LDC and transcribed
by LDC or under its direction. LDC has released the following GALE Phase 1 & 2 Arabic
Parallel Text data sets: * GALE Phase 1 Arabic Broadcast News Parallel Text - Part
1 (LDC2007T24) * GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09)
* GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02) * GALE Phase 1 Arabic Newsgroup
Parallel Text - Part 1 (LDC2009T03) * GALE Phase 1 Arabic Newsgroup Parallel Text
- Part 2 (LDC2009T09) * GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part
1 (LDC2012T06) * GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14)
* GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17) * GALE Phase 2 Arabic Broadcast
News Parallel Text (LDC2012T18) * GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)
*Data* GALE Phase 2 Arabic Newswire Parallel Text includes 400 source-translation
pairs, comprising 181,704 tokens of Arabic source text and its English translation.
Data is drawn from six distinct Arabic newswire sources.: Al Ahram, Al Hayat, Al-Quds
Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah. Data was manually selected for translation
according to several criteria, including linguistic features and topic features. The
files were formatted into a human-readable translation format and assigned to translation
vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual
LDC staff performed quality control procedures on the completed translations. Source
data and translations are distributed in TDF format. TDF files are tab-delimited files
containing one segment of text along with meta information about that segment. Each
field in the TDF file is described in TDF_format.text. All data are encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636266
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 198-319-621-200-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast News Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast News Parallel Text was developed by the Linguistic Data
Consortium (LDC). Along with other corpora, the parallel text in this release comprised
training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source text and corresponding
English translations selected from broadcast news (BN) data collected by LDC between
2005 and 2007 and transcribed by LDC or under its direction. LDC has released the
following GALE Phase 1 & 2 Arabic Parallel Text data sets: * GALE Phase 1 Arabic Broadcast
News Parallel Text - Part 1 (LDC2007T24) * GALE Phase 1 Arabic Broadcast News Parallel
Text - Part 2 (LDC2008T09) * GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02) *
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03) * GALE Phase 1 Arabic
Newsgroup Parallel Text - Part 2 (LDC2009T09) * GALE Phase 2 Arabic Broadcast Conversation
Parallel Text Part 1 (LDC2012T06) * GALE Phase 2 Arabic Broadcast Conversation Parallel
Text Part 2 (LDC2012T14) * GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17)
* GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18) * GALE Phase 2 Arabic
Web Parallel Text (LDC2013T01) *Data* GALE Phase 2 Arabic Broadcast News Parallel
Text includes seven source-translation pairs, comprising 29,210 words of Arabic source
text and its English translation. Data is drawn from six distinct Arabic programs
broadcast between 2005 and 2007 from Abu Dhabi TV, based in Abu Dhabi, United Arab
Emirates Al Alam News Channel, based in Iran Aljazeera, a regional broadcast programmer
based in Doha, Qatar Dubai TV, based in Dubai, United Arab Emirates and Kuwait TV,
a national television station based in Kuwait. The BN programming in this release
focuses on current events topics. The files in this release were transcribed by LDC
staff and/or transcription vendors under contract to LDC in accordance with the Quick
Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries
in addition to transcribing the text. Data was manually selected for translation according
to several criteria, including linguistic features, transcription features and topic
features. The transcribed and segmented files were then reformatted into a human-readable
translation format and assigned to translation vendors. Translators followed LDCs
Arabic to English translation guidelines. Bilingual LDC staff performed quality control
procedures on the completed translations. Source data and translations are distributed
in TDF format. TDF files are tab-delimited files containing one segment of text along
with meta information about that segment. Each field in the TDF file is described
in TDF_format.text. All data are encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636371
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 418-297-555-875-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 was developed by the
Linguistic Data Consortium (LDC) and contains transcriptions of approximately 123
hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC, MediaNet,
Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous
Language Exploitation) program. Corresponding audio data is released as GALE Phase
2 Arabic Broadcast Conversation Speech Part 1 (LDC2013S02). The source broadcast conversation
recordings feature interviews, call-in programs and round table discussions focusing
principally on current events from the following sources: Al Alam News Channel, based
in Iran, Al Arabiya, a news television station based in Dubai, Aljazeera, a regional
broadcaster located in Doha, Qatar, Al Ordiniyah, a national broadcast station in
Jordan, Lebanese Broadcasting Corporation, a Lebanese television station, Nile TV,
a broadcast programmer based in Egypt, Oman TV, a national broadcaster located in
the Sultanate of Oman, Saudi TV, a national television station based in Saudi Arabia
and Syria TV, the national television station in Syria. *Data* The transcript files
are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed
data totals 752,747 tokens. The transcripts were created with the LDC-developed transcription
tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that
supports manual transcription and annotation of audio recordings. XTrans is available
from the following link, http://www.ldc.upenn.edu/tools/XTrans/downloads/. The files
in this corpus were transcribed by LDC staff and/or by transcription vendors under
contract to LDC. Transcribers followed LDCs quick transcription guidelines (QTR) and
quick rich transcription specification (QRTR) both of which are included in the documentation
with this release. QTR transcription consists of quick (near-)verbatim, time-aligned
transcripts plus speaker identification with minimal additional mark-up. It does not
include sentence unit annotation. QRTR annotation adds structural information such
as topic boundaries and manual sentence unit annotation to the core components of
a quick transcript. Files with QTR as part of the filename were developed using QTR
transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636347
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 548-942-841-002-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Web Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Web Parallel Text was developed by the Linguistic Data Consortium
(LDC). Along with other corpora, the parallel text in this release comprised training
data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
This corpus contains Modern Standard Arabic source text and corresponding English
translations selected from web data collected by LDC and translated by LDC or under
its direction. LDC has released the following GALE Phase 1 & 2 Arabic Parallel Text
data sets: * GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24)
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09) * GALE Phase
1 Arabic Blog Parallel Text (LDC2008T02) * GALE Phase 1 Arabic Newsgroup Parallel
Text - Part 1 (LDC2009T03) * GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
(LDC2009T09) * GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14) * GALE
Phase 2 Arabic Newswire Parallel Text (LDC2012T17) * GALE Phase 2 Arabic Broadcast
News Parallel Text (LDC2012T18) * GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)
*Data* GALE Phase 2 Arabic Web Parallel Text includes 60 source-translation document
pairs, comprising 42,089 words of Arabic source text and its English translation.
Data was drawn from various Arabic weblog and newsgroup sources. Data was manually
selected for translation according to several criteria, including linguistic features,
transcription features and topic features. The files were reformatted into a human-readable
translation format and assigned to translation vendors. Translators followed LDC's
Arabic to English translation guidelines. Bilingual LDC staff performed quality control
procedures on the completed translations. Source data and translations are distributed
in TDF format. TDF files are tab-delimited files containing one segment of text along
with meta information about that segment. Each field in the TDF file is described
in TDF_format.text. All data are encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Electronic discussion groups
- Form subdivision:
Translations into English
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Blogs
- Form subdivision:
Translations into English
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Translations into English
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636282
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 685-457-689-684-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire was developed
by the Linguistic Data Consortium (LDC) and contains 169,080 tokens of word aligned
Chinese and English parallel text enriched with linguistic tags. This material was
used as training data in the DARPA GALE (Global Autonomous Language Exploitation)
program. Some approaches to statistical machine translation include the incorporation
of linguistic knowledge in word aligned text as a means to improve automatic word
alignment and machine translation quality. This is accomplished with two annotation
schemes: alignment and tagging. Alignment identifies minimum translation units and
translation relations by using minimum-match and attachment annotation approaches.
A set of word tags and alignment link tags are designed in the tagging scheme to describe
these translation units and relations. Tagging adds contextual, syntactic and language-specific
features to the alignment annotation. GALE Chinese-English Word Alignment and Tagging
Training Part 1 -- Newswire and Web (LDC2012T16) is also available through LDC. *Data*
This release consists of Chinese source newswire collected by LDC in 2008. The distribution
by genre, words, character tokens and segments appears below: Language Genre Files
Words CharTokens Segments Chinese nw 1982 112720 169080 4239 Note that all token counts
are based on the Chinese data only. One token is equivalent to one character and one
word is equivalent to 1.5 characters. The Chinese word alignment tasks consisted of
the following components: * Identifying, aligning, and tagging 8 different types of
links * Identifying, attaching, and tagging local-level unmatched words * Identifying
and tagging sentence/discourse-level unmatched words * Identifying and tagging all
instances of Chinese 的 (DE) except when they were a part of a semantic link The file
names indicate the source provider, the story date and the language. For example,
AFP_CMN_20080406 refers to the source Agence France Presse (AFP), the story date is
April 6, 2008 and the language is Chinese (CMN).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Machine translating
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636398
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 219-635-569-532-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web was developed
by LDC and contains 158,387 tokens of word aligned Chinese and English parallel text
enriched with linguistic tags. This material was used as training data in the DARPA
GALE (Global Autonomous Language Exploitation) program. Some approaches to statistical
machine translation include the incorporation of linguistic knowledge in word aligned
text as a means to improve automatic word alignment and machine translation quality.
This is accomplished with two annotation schemes: alignment and tagging. Alignment
identifies minimum translation units and translation relations by using minimum-match
and attachment annotation approaches. A set of word tags and alignment link tags are
designed in the tagging scheme to describe these translation units and relations.
Tagging adds contextual, syntactic and language-specific features to the alignment
annotation. Other releases available in this series are: * GALE Chinese-English Word
Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16) * GALE Chinese-English
Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20) * GALE Chinese-English
Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24) *Data* This release
consists of Chinese source web data (newsgroup, weblog) collected by LDC. The distribution
by words, character tokens and segments appears below: Language Files Words CharTokens
Segments Chinese 1,224 105,591 158,387 4,836 Note that all token counts are based
on the Chinese data only. One token is equivalent to one character and one word is
equivalent to 1.5 characters. The Chinese word alignment tasks consisted of the following
components: * Identifying, aligning, and tagging 8 different types of links * Identifying,
attaching, and tagging local-level unmatched words * Identifying and tagging sentence/discourse-level
unmatched words * Identifying and tagging all instances of Chinese 的 (DE) except when
they were a part of a semantic link.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636355
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 896-999-017-833-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST 2012 Open Machine Translation (OpenMT) Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2012 Open Machine Translation (OpenMT) Evaluation was developed by NIST Multimodal
Information Group. This release contains source data, reference translations and scoring
software used in the NIST 2012 OpenMT evaluation, specifically, for the Chinese-to-English
language pair track. The package was compiled and scoring software was developed at
NIST, making use of Chinese newswire and web data and reference translations collected
and developed by LDC. The objective of the OpenMT evaluation series is to support
research in, and help advance the state of the art of, machine translation (MT) technologies
-- technologies that translate text between human languages. Input may include all
forms of text. The goal is for the output to be an adequate and fluent translation
of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES
(Translingual Information Detection, Extraction) program. Beginning with the 2006
evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT.
These evaluations provide an important contribution to the direction of research efforts
and the calibration of technical capabilities in MT. The Open MT evaluations are intended
to be of interest to all researchers working on the general problem of automatic translation
between human languages. To this end, they are designed to be simple, to focus on
core technology issues and to be fully supported. The 2012 task was to evaluate five
language pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English
and Korean-to-English. This release consists of the material used in the Chinese-to-English
language pair track. For more general information about the NIST OpenMT evaluations,
please refer to the NIST OpenMT website. This evaluation kit includes a single Perl
script (mteval-v13a.pl) that may be used to produce a translation quality score for
one (or more) MT systems. The script works by comparing the system output translation
with a set of (expert) reference translations of the same source text. Comparison
is based on finding sequences of words in the reference translations that match word
sequences in the system output translation. *Data* This release contains 222 documents
with corresponding source and reference files, the latter of which contains four independent
human reference translations of the source data. The source data is comprised of Chinese
newswire and web data collected by LDC in 2011. A portion of the web data concerned
the topic of food and was treated as a restricted domain. The table below displays
statistics by source, genre, documents, segments and source tokens. Source Genre Documents
Segments Source Tokens Chinese General Newswire 45 400 18184 Chinese General Web Data
28 420 15181 Chinese Restricted Domain Web Data 149 2184 48422 The token counts for
Chinese data are character counts, which were obtained by counting tokens matching
the UNICODE-based regular expression w. The Python re module was used to obtain those
counts. The data in this package are in XML format compliant with the included DTD
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636363
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 463-416-327-487-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast Conversation Speech Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 was developed by the Linguistic
Data Consortium (LDC) and is comprised of approximately 123 hours of Arabic broadcast
conversation speech collected in 2006 and 2007 by LDC as part of the DARPA GALE (Global
Autonomous Language Exploitation) Program. Corresponding transcripts are released
as GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 (LDC2013T04). Broadcast
audio for the DARPA GALE program was collected at the LDC Philadelphia, PA USA facilities
and at three remote collection sites: Hong Kong University of Science and Technology,
Hong Kong, Republic of China (Chinese) Medianet, Tunis, Tunisia (Arabic) and MTC,
Rabat, Morocco (Arabic). The combined local and outsourced broadcast collection supported
GALE at a rate of approximately 300 hours per week of programming from more than 50
broadcast sources for a total of over 30,000 hours of collected broadcast audio over
the life of the program. The LDC local broadcast collection system is highly automated,
easily extensible and robust and capable of collecting, processing and evaluating
hundreds of hours of content from several dozen sources per day. The broadcast material
is served to the system by a set of free-to-air (FTA) satellite receivers, commercial
direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers,
and cable television (CATV) feeds. The mapping between receivers and recorders is
dynamic and modular. All signal routing is performed under computer control, using
a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and
are then processed to extract audio, to generate keyframes and compressed audio/video,
to produce time-synchronized closed captions (in the case of North American English)
and to generate automatic speech recognition (ASR) output. An overview of the system,
the sources recorded and the configuration of the recording laboratory are contained
in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.
LDC desgined a portable platform for remote broadcast collection. This is a TiVO-style
digital video recording (DVR) system that records two streams of A/V material simultaneously.
It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can
operate outside of the United States. It has a small footprint weighs less than 30
pounds and can be transported as carry-on luggage. Medianet collected Arabic programming
from across the Gulf region using its internal system and LDCs portable broadcast
collection platform installed in 2008. The portable platform deployed at the Medianet
Tunisian collection facility collected multiple streams of regional Arabic programming
from various sources. MTC collected Arabic programming using its internal collection
system. *Data* The broadcast conversation recordings in this release feature interviews,
call-in programs and round table discussions focusing principally on current events
from the following sources: Al Alam News Channel, based in Iran, Al Arabiya, a news
television station based in Dubai, Aljazeera, a regional broadcaster located in Doha,
Qatar, Al Ordiniyah, a national broadcast station in Jordan, Lebanese Broadcasting
Corporation, a Lebanese television station, Nile TV, a broadcast programmer based
in Egypt, Oman TV, a national broadcaster located in the Sultanate of Oman, Saudi
TV, a national television station based in Saudi Arabia and Syria TV, the national
television station in Syria. A table showing the number of programs and hours recorded
from each source is contained in the readme file. This release contains 143 audio
files presented in Waveform Audio File format (.wav), 16000 Hz single-channel 16-bit
PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification
Version 2.0 which is included in this release. The broadcast auditing process served
three principal goals: as a check on the operation of the broadcast collection system
equipment by identifying failed, incomplete or faulty recordings, as an indicator
of broadcast schedule changes by identifying instances when the incorrect program
was recorded, and as a guide for data selection by retaining information about program
genre, data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Spoken Arabic
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636290
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 335-916-789-872-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Annotated English Gigaword
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Annotated English Gigaword was developed by Johns Hopkins University's Human Language
Technology Center of Excellence. It adds automatically-generated syntactic and discourse
structure annotation to English Gigaword Fifth Edition (LDC2011T07) and also contains
an API and tools for reading the dataset's XML files. The goal of the annotation is
to provide a standardized corpus for knowledge extraction and distributional semantics
which enables broader involvement in large-scale knowledge-acquisition efforts by
researchers. *Data* Annotated English Gigaword contains the nearly ten million documents
(over four billion words) of the original English Gigaword Fifth Edition from seven
news sources: * Agence France-Presse, English Service (afp_eng) * Associated Press
Worldstream, English Service (apw_eng) * Central News Agency of Taiwan, English Service
(cna_eng) * Los Angeles Times/Washington Post Newswire Service (ltw_eng) * Washington
Post/Bloomberg Newswire Service (wpb_eng) * New York Times Newswire Service (nyt_eng)
* Xinhua News Agency, English Service (xin_eng) The following layers of annotation
were added: * Tokenized and segmented sentences * Treebank-style constituent parse
trees * Syntactic dependency trees * Named entities * In-document coreference chains
The annotation was performed in a three-step process: (1) the data was preprocessed
and sentences selected for annotation (sentences with more than 100 tokens were excluded)
(2) syntactic parses were derived and (3) the parsed output was post-processed to
derive syntactic dependencies, named entities and coreference chains. Over 183 million
sentences were parsed. The data is stored in a form similar to the gigaword SGML format
with XML annotations containing the additional markup. The included API provides object
representations for the contents of the XML files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Napoles, Courtney
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gormley, Matthew R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Van Durme, Benjamin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636320
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 369-177-595-916-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web was developed
by the Linguistic Data Consortium (LDC) and contains 154,541 tokens of word aligned
Chinese and English parallel text enriched with linguistic tags. This material was
used as training data in the DARPA GALE (Global Autonomous Language Exploitation)
program. Some approaches to statistical machine translation include the incorporation
of linguistic knowledge in word aligned text as a means to improve automatic word
alignment and machine translation quality. This is accomplished with two annotation
schemes: alignment and tagging. Alignment identifies minimum translation units and
translation relations by using minimum-match and attachment annotation approaches.
A set of word tags and alignment link tags are designed in the tagging scheme to describe
these translation units and relations. Tagging adds contextual, syntactic and language-specific
features to the alignment annotation. GALE Chinese-English Word Alignment and Tagging
Training Part 1 -- Newswire and Web (LDC2012T16) and GALE Chinese-English Word Alignment
and Tagging Training Part 2 -- Newswire (LDC2012T20) are also available through LDC.
*Data* This release consists of Chinese source web data (newsgroup, weblog) collected
by LDC in 2008 and 2009. The distribution by words, character tokens and segments
appears below: Language Files Words CharTokens Segments Chinese 1249 103027 154541
4842 Note that all token counts are based on the Chinese data only. One token is equivalent
to one character and one word is equivalent to 1.5 characters. The Chinese word alignment
tasks consisted of the following components: * Identifying, aligning, and tagging
8 different types of links * Identifying, attaching, and tagging local-level unmatched
words * Identifying and tagging sentence/discourse-level unmatched words * Identifying
and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic
link.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636304
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 649-519-153-695-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese-English Semiconductor Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese-English Semiconductor Parallel Text was developed by The MITRE Corporation.
It consists of parallel sentences from a collection of abstracts from scientific articles
on semiconductors published in Mandarin and translated into English by translators
with particular expertise in the technical area. Translators were instructed to err
on the side of literal translation if required, but to maintain the technical writing
style of the source and to make the resulting English as natural as possible. The
translators followed specific guidelines for translation, and those are included in
this distribution. *Data* There are 2,169 lines of parallel Mandarin and English,
with a total of 125,302 characters of Mandarin and 64,851 words of English, presented
in a separate UTF-8 plain text file for each language. The sentences were translated
in sequential order and presented in a scrambled order, such that parallel sentences
at identical line numbers are translations. For example, the 31st line of the English
file is a translation of the 31st line of the Mandarin file. The original line sequence
is not provided.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Semiconductors
- Form subdivision:
Periodicals
- Form subdivision:
Abstracts
- Form subdivision:
Translations into English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doran, Christine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Burger, John D.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Henderson, John C.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zarrella, Guido
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2012 pau u rus d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636312
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2012T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 336-503-445-973-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Russian-English Computer Security Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2012]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2012T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Russian-English Computer Security Parallel Text was developed by The MITRE Corporation.
It consists of parallel sentences from a set of computer security reports published
in Russian and translated into English by translators with particular expertise in
the technical area. Translators were instructed to err on the side of literal translation
if required, but to maintain the technical writing style of the source and to make
the resulting English as natural as possible. The translators followed specific guidelines
for translation, and those are included in this distribution. *Data* There are 6,276
lines of parallel Russian and English, with a total of 60,059 words of Russian and
76,437 words of English, presented in a separate UTF-8 plain text file for each language.
The sentences were translated in sequential order and presented in a scrambled order,
such that parallel sentences at identical line numbers are translations. For example,
the 31st line of the English file is a translation of the 31st line of the Russian
file. The original line sequence is not provided. 1,694 untranslated lines (such as
code snippets) are included as a separate file
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Russian and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doran, Christine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Burger, John D.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Henderson, John C.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zarrella, Guido
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2012T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636274
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 278-123-012-906-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese-English Biology and Chemistry Abstract Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese-English Biology and Chemistry Abstract Parallel Text was developed by The
MITRE Corporation. It consists of parallel sentences from a collection of chemistry
and biology-related scientific article abstracts published in Mandarin and translated
into English by translators with particular expertise in the technical area. Translators
were instructed to err on the side of literal translation if required, but to maintain
the technical writing style of the source and make the resulting English as natural
as possible. The translators were given specific guidelines for translation, and those
are included in this distribution. *Data* This release contains 2,239 lines of parallel
Mandarin and English, with a total of 156,445 characters of Mandarin and 75,515 words
of English, presented in a separate UTF-8 plain text file for each language. The sentences
were translated in sequential order and presented in scrambled order, such that parallel
sentences at identical line numbers are translations. For example, the 31st line of
the English file is a translation of the 31st line of the Mandarin file. The original
line sequence is not provided.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Biology
- Form subdivision:
Abstracts
- Form subdivision:
Translations into English
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chemistry
- Form subdivision:
Abstracts
- Form subdivision:
Translations into English
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese periodicals
- Form subdivision:
Abstracts
- Form subdivision:
Translations into English
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Translating into English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doran, Christine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Burger, John D.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Henderson, John C.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zarrella, Guido
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u spa d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 375-727-871-052-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
1993-2007 United Nations Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
1993-2007 United Nations Parallel Text was developed by Google Research. It consists
of United Nations (UN) parliamentary documents from 1993 through 2007 in the official
languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. There
are 673,670 raw text documents and 520,283 word alignment documents. UN parliamentary
documents are available from the UN Official Document System (UN ODS) at http://ods.un.org/.
UN ODS, in its main UNDOC database, contains the full text of all types of UN parliamentary
documents. It has complete coverage datng from 1993 and variable coverage before that.
Documents exist in one or more of the official languages of the UN: Arabic, Chinese,
English, French, Russian, and Spanish. UN ODS also contains a large number of German
documents, marked with the language other, but these are not included in this dataset.
For more information, see the UN ODS documentation at http://documents.un.org/help_E.htm.
For more details of the UN bibliographic systems, see http://www.un.org/depts/dhl/unbisref_manual/.
LDC has released parallel UN parliamentary documents in English, French and Spanish
spanning the period 1988-1993, UN Parallel Text (Complete) (LDC94T4A). *Data* The
data is presented as raw text and word-aligned text. The raw text is very close to
what was extracted from the original word processing documents in UN ODS (e.g., Word,
WordPerfect, PDF), converted to UTF-8 encoding. The word-aligned text was normalized,
tokenized, aligned at the sentence-level, further broken into sub-sentential chunk-pairs,
and then aligned at the word. The sentence, chunk, and word alignment operations were
performed separately for each individual language pair. The files are presented in
tar files and compressed using the bzip2 compression utility. The bzip2 utility is
standard in most Linux releases. For Windows users, there are a variety of decompression
software options. 7-Zip will decompress tar and bzip2 formats. Note that in the data/aligned
folder, the en-zh-1993.tar.bz2 and en-zh-1994.tar.bz2 archives decompress into empty
folders. This is intentional as there is no Chinese aligned data for those two years.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish, Russian, French, English, Mandarin Chinese, Arabic, and Chinese.
Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Franz, Alex
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kumar, Shankar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brants, Thorsten
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636525
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 067-355-674-551-6
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Mixer 6 Speech was developed by the Linguistic Data Consortium (LDC) and comprises
15,863 hours of audio recordings of interviews, transcript readings and conversational
telephone speech involving 594 distinct native English speakers. This material was
collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase
6, the focus of which was on native American English speakers local to the Philadelphia
area. The speech data in this release was collected by LDC at its Human Subjects Collection
facilities in Philadelphia. The telephone collection protocol was similar to other
LDC telephone studies (e.g., Switchboard-2 Phase III Audio - LDC2002S06): recruited
speakers were connected through a robot operator to carry on casual conversations
lasting up to 10 minutes, usually about a daily topic announced by the robot operator
at the start of the call. The raw digital audio content for each call side was captured
as a separate channel, and each full conversation was presented as a 2-channel interleaved
audio file, with 8000 samples/second and u-law sample encoding. Each speaker was asked
to complete 15 calls. The multi-microphone portion of the collection utilized 14 distinct
microphones installed identically in two mutli-channel audio recording rooms at LDC.
Each session was guided by collection staff using prompting and recording software
to conduct the following activities: (1) repeat questions (less than one minute),
(2) informal conversation (typically 15 minutes), (3) transcript reading (approximately
15 minutes) and (4) telephone call (generally 10 minutes). Speakers recorded up to
three 45-minute sessions on distinct days. The 14 channels were recorded synchronously
into separate single-channel files, using 16-bit PCM sample encoding at 16000 samples/second.
Certain demographic information about the speakers was collected, including date of
birth, level of education, native language, other language capability, place of birth,
place of residence and occupation. The recordings in this corpus were used in NIST
Speaker Recognition Evaluation (SRE) test sets for 2010 and 2012. Researchers interested
in applying those benchmark test sets should consult the respective NIST Evaluation
Plans for guidelines on allowable training data for those tests. *Data* The collection
contains 4,410 recordings made via the public telephone network and 1,425 sessions
of multiple microphone recordings in office-room settings. The telephone recordings
are presented as 8-KHz 2-channel NIST SPHERE files, and the microphone recordings
are 16-KHz 1-channel flac/ms-wav files. All audio files names indicate the date and
time when the recording began, along with other identifying information, as follows:
Telephone: {yyyymmdd}_{hrmnsc}_{callid}.sph Microphone: {yyyymmdd}_{hrmnsc}_{room}_{subjid}_CH{nn}.flac
* yyyymmdd is the year, month and date of recording. * hrmnsc is the hour, minute
and second when recording began * callid is a unique, incremental number assigned
to each call * room is either LDC or HRM, indicating which office was used * subjid
is a numeric identifier assigned to the speaker When the flac files are uncompressed,
they become ms-wav/RIFF files (flac compression does not presently support SPHERE
file format). The telephone audio is presented in SPHERE format because (a) this is
consistent with other telephone audio releases from LDC, and (b) flac does not support
ulaw sample encoding. The current release of the open-source SoX utility is able to
handle both formats as input. Other utilities are available for both flac and SPHERE
formats.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Telephone calls
- Form subdivision:
Databases.
- Geographic subdivision:
United States
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Speech perception
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brandschain, Linda
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u chi d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 839-419-404-977-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Broadcast Conversation Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Broadcast Conversation Transcripts was developed by the Linguistic
Data Consortium (LDC) and contains transcriptions of approximately 120 hours of Chinese
broadcast conversation speech collected in 2006 and 2007 by LDC and Hong University
of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global
Autonomous Language Exploitation) Program. Corresponding audio data is released as
GALE Phase 2 Chinese Broadcast Conversation Speech (LDC2013S04). The source broadcast
conversation recordings feature interviews, call-in programs and round table discussions
focusing principally on current events from the following sources: Anhui TV, a regional
television station in Mainland China, Anhui Province, China Central TV (CCTV), a national
and international broadcaster in Mainland China, Hubei TV, a regional broadcaster
in Mainland China, Hubei Province, and Phoenix TV, a Hong Kong-based satellite television
station. *Data* The transcript files are in plain-text, tab-delimited format (TDF)
with UTF-8 encoding, and the transcribed data totals 1,523,373 tokens. The transcripts
were created with the LDC-developed transcription tool, XTrans, a multi-platform,
multilingual, multi-channel transcription tool that supports manual transcription
and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed the quick transcription guidelines (QTR)
and quick rich transcription specification (QRTR) developed by LDC, both of which
are included in the documentation with this release. QTR transcription consists of
quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal
additional mark-up. It does not include sentence unit annotation. QRTR annotation
adds structural information such as topic boundaries and manual sentence unit annotation
to the core components of a quick transcript. Files with QTR as part of the filename
were developed using QTR transcription. Files with QRTR in the filename indicate QRTR
transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636428
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 995-434-316-201-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Broadcast Conversation Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Broadcast Conversation Speech was developed by the Linguistic
Data Consortium (LDC) and is comprised of approximately 120 hours of Chinese broadcast
conversation speech collected in 2006 and 2007 by LDC and Hong University of Science
and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. Corresponding transcripts are released as GALE Phase
2 Chinese Broadcast Conversation Transcripts (LDC2013T08). Broadcast audio for the
GALE program was collected at the Philadelphia, PA USA facilities of LDC and at three
remote collection sites: HKUST (Chinese) Medianet, Tunis, Tunisia (Arabic) and MTC,
Rabat, Morocco (Arabic). The combined local and outsourced broadcast collection supported
GALE at a rate of approximately 300 hours per week of programming from more than 50
broadcast sources for a total of over 30,000 hours of collected broadcast audio over
the life of the program. The LDC local broadcast collection system is highly automated,
easily extensible and robust and capable of collecting, processing and evaluating
hundreds of hours of content from several dozen sources per day. The broadcast material
is served to the system by a set of free-to-air (FTA) satellite receivers, commercial
direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers,
and cable television (CATV) feeds. The mapping between receivers and recorders is
dynamic and modular. All signal routing is performed under computer control, using
a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and
are then processed to extract audio, to generate keyframes and compressed audio/video,
to produce time-synchronized closed captions (in the case of North American English)
and to generate automatic speech recognition (ASR) output. An overview of the system,
the sources recorded and the configuration of the recording laboratory are contained
in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.
LDC designed a portable platform for remote broadcast collection. This is a TiVO-style
digital video recording (DVR) system that records two streams of A/V material simultaneously.
It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can
operate outside of the United States. It has a small footprint, weighs less than 30
pounds and can be transported as carry-on luggage. HKUST collected Chinese broadcast
programming using its internal recording system and a portable broadcast collection
platform designed by LDC and installed at HKUST in 2006. *Data* The broadcast conversation
recordings in this release feature interviews, call-in programs and roundtable discussions
focusing principally on current events from the following sources: Anhui TV, a regional
television station in Mainland China, Anhui Province, China Central TV (CCTV), a national
and international broadcaster in Mainland China, Hubei TV, a regional broadcaster
in Mainland China, Hubei Province, and Phoenix TV, a Hong Kong-based satellite television
station. A table showing the number of programs and hours recorded from each source
is contained in the readme file. This release contains 202 audio files presented in
Waveform Audio File format (.wav), 16000 Hz single-channel 16-bit PCM. Each file was
audited by a native Chinese speaker following Audit Procedure Specification Version
2.0 which is included in this release. The broadcast auditing process served three
principal goals: (1) as a check on the operation of the broadcast collection system
equipment by identifying failed, incomplete or faulty recordings, (2) as an indicator
of broadcast schedule changes by identifying instances when the incorrect program
was recorded and (3) as a guide for data selection by retaining information about
the genre, data type and topic of a program.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636436
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 828-846-182-243-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
MADCAT Phase 2 Training Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phase
2 Training Set contains all training data created by the Linguistic Data Consortium
to support Phase 2 of the DARPA MADCAT Program. The data in this release consists
of handwritten Arabic documents, scanned at high resolution and annotated for the
physical coordinates of each line and token. Digital transcripts and English translations
of each document are also provided, with the various content and annotation layers
integrated in a single MADCAT XML output. The goal of the MADCAT program is to automatically
convert foreign text images into English transcripts. MADCAT Phase 2 data was collected
from Arabic source documents in three genres: newswire, weblog and newsgroup text.
Arabic speaking scribes copied documents by hand, following specific instructions
on writing style (fast, normal, careful), writing implement (pen, pencil) and paper
(lined, unlined). Prior to assignment, source documents were processed to optimize
their appearance for the handwriting task, which resulted in some original source
documents being broken into multiple pages for handwriting. Each resulting handwritten
page was assigned to up to five independent scribes, using different writing conditions.
The handwritten, transcribed documents were checked for quality and completeness,
then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital
version of the handwritten document. The scanned images were then annotated to indicate
the physical coordinates of each line and token. Explicit reading order was also labeled,
along with any errors produced by the scribes when copying the text. The final step
was to produce a unified data format that takes multiple data streams and generates
a single MADCAT XML output file with all required information. The resulting madcat.xml
file has these distinct components: (1) a text layer that consists of the source text,
tokenization and sentence segmentation, (2) an image layer that consist of bounding
boxes, (3) a scribe demographic layer that consists of scribe ID and partition (train/test)
and (4) a document metadata layer. LDC has also released: * MADCAT Phase 1 Training
Set (LDC2012T15) * MADCAT Phase 3 Training Set (LDC2013T15) * MADCAT Chinese Pilot
Training Set (LDC2014T13) *Data* This release includes 27,814 annotation files in
both GEDI XML and MADCAT XML formats (gedi.xml and madcat.xml) along with their corresponding
scanned image files in TIFF format. The annotation results in GEDI XML output files
include ground truth annotations and source transcripts. Files are named as follows:
* galeID_page#_scribeID.{tif|gedi.xml|madcat.xml}
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Written Arabic
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Translating into English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doermann, Dave
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636444
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 544-051-024-519-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Arabic-English Parallel Aligned Treebank -- Newswire
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Arabic-English Parallel Aligned Treebank -- Newswire was developed by the Linguistic
Data Consortium (LDC) and contains 267,520 tokens of word aligned Arabic and English
parallel text with treebank annotations. This material was used as training data in
the DARPA GALE (Global Autonomous Language Exploitation) program. Parallel aligned
treebanks are treebanks annotated with morphological and syntactic structures aligned
at the sentence level and the sub-sentence level. Such data sets are useful for natural
language processing and related fields, including automatic word alignment system
training and evaluation, transfer-rule extraction, word sense disambiguation, translation
lexicon extraction and cultural heritage and cross-linguistic studies. With respect
to machine translation system development, parallel aligned treebanks may improve
system performance with enhanced syntactic parsers, better rules and knowledge about
language pairs and reduced word error rate. In this release, the source Arabic data
was translated into English. Arabic and English treebank annotations were performed
independently. The parallel texts were then word aligned. The material in this corpus
corresponds to the Arabic treebanked data appearing in Arabic Treebank: Part 3 v 3.2
(LDC2010T08) (ATB) and to the English treebanked data in English Translation Treebank:
An-Nahar Newswire (LDC2012T02). *Data* The source data consists of Arabic newswire
from the Lebanese publication An Nahar collected by LDC in 2002. All data is encoded
as UTF-8. A count of files, words, tokens and segments is below. Language Files Words
Tokens Segments Arabic 364 182,351 267,520 7,711 Note: Word count is based on the
untokenized Arabic source and token count is based on the ATB-tokenized Arabic source.
The purpose of the GALE word alignment task was to find correspondences between words,
phrases or groups of words in a set of parallel texts. Arabic-English word alignment
annotation consisted of the following tasks: * Identifying different types of links:
translated (correct or incorrect) and not translated (correct or incorrect) * Identifying
sentence segments not suitable for annotation, e.g., blank segments, incorrectly-segmented
segments, segments with foreign languages * Tagging unmatched words attached to other
words or phrases This release contains four types of files - raw, tokenized, treebank,
and wa. The raw format contains the original Arabic and English sentences without
any annotation. The tokenized format is the treebank tokenized version of the raw
data. It may contain Empty Category tokens (treebank leaves that have the POS label
-NONE-). The treebank and wa files are treebank and word alignment annotations on
the tokenized files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zakhary, Dalal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636452
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 854-216-857-102-2
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Greybeard was developed by the Linguistic Data Consortium (LDC) and is comprised of
approximately 590 hours of English telephone conversation speech collected in October
and November 2008 by LDC. The goal was to record new telephone conversations among
subjects who had participated in one or more previous LDC telephone collections, from
Switchboard-1 (1991) through the Mixer studies (2006). A total of 172 subjects were
enrolled in the Greybeard collection, all of whom had participated in one of the following:
* Switchboard-1 (LDC97S62) 1991-1992: 2 subjects * Switchboard-2 (LDC98S75, LDC99S79,
LDC2002S06) 1996-1997: 16 subjects * Mixer 1 and 2 2003-2005: 103 subjects * Mixer
3 2006: 51 subjects Most Greybeard participants completed 12 calls. Some subjects
completed up to 24 calls. Calls were made or received via an automatic operator system
at LDC which connected two participants and announced a topic for discussion. *Data*
This releases consists of 4680 calls -- the complete set of calls recorded during
the Greybeard collection (1098 calls) as well as all calls from the legacy collections
that involved the Greybeard speakers. The audio from each call was captured digitally
by the operator system and stored in a separate file as raw mu-law sample data. As
the recordings were uploaded daily from the robot operator to network disk storage,
automated processes reformatted the audio into a 2-channel SPHERE-format file for
each conversation and queued the recordings for manual audit to verify speaker identification
and to check other aspects of the recording. Auditors provided impressionistic judgments
on overall audio quality, presence of background noise and cross-channel echo and
any other technical difficulty with the call, in addition to confirming the speaker-ID
on each channel. These auditor decisions are provided in the call_info tables, described
in more detail in the included documentation. For this release, each 2-channel recording
was converted from SPHERE to MS-WAV file format and compressed using FLAC. All audio
files are 2-channel, 8 KHz, 16-bit PCM sample data, in FLAC-compressed form (http://flac.sourceforge.net).
When uncompressed, they have MS-WAV/RIFF headers.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Spoken English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brandschain, Linda
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636460
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 932-371-960-415-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1 was developed by
the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text
in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. This corpus contains Chinese source text and corresponding
English translations selected from broadcast conversation (BC) data collected by LDC
in 2006 and 2007 and transcribed by LDC or under its direction. *Data* This release
includes 21 source-translation document pairs, comprising 146,082 characters of Chinese
source text and its English translation. Data is drawn from seven distinct Chinese
programs broadcast in 2006 and 2007 from the following sources -- China Central TV,
a national and international broadcaster in Mainland China and Phoenix TV, a Hong
Kong-based satellite television station. Broadcast conversation programming is generally
more interactive than traditional news broadcasts and includes talk shows, interviews,
call-in programs and roundtable discussions. The programs in this release focus on
current events topics. The data was transcribed by LDC staff and/or transcription
vendors under contract to LDC in accordance with Quick Rich Transcription guidelines
developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing
the text. Data was manually selected for translation according to several criteria,
including linguistic features, transcription features and topic features. The transcribed
and segmented files were then reformatted into a human-readable translation format
and assigned to translation vendors. Translators followed LDCs Chinese to English
translation guidelines. Bilingual LDC staff performed quality control procedures on
the completed translations. Source data and translations are distributed in TDF format.
TDF files are tab-delimited files containing one segment of text along with meta information
about that segment. Each field in the TDF file is described in TDF_format.text. All
data are encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636479
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 021-129-973-518-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Manually Annotated Sub-Corpus Third Release
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Manually Annotated Sub-Corpus (MASC) Third Release was developed as part of The American
National Corpus project and consists of approximately 500,000 words of contemporary
American English written and spoken data annotated for a wide variety of linguistic
phenomena. The MASC project was established to address, to the extent possible, many
of the obstacles to the creation of large-scale, robust, multiply-annotated corpora
of English covering a wide range of genres of written and spoken language data. The
project provides appropriate data and annotations to serve as the base for a community-wide
annotation effort, together with an infrastructure that enables the incorporation
of contributed annotations into a single, usable format that can then be analyzed
as it is or transduced to any of a variety of other formats. The aim is to offset
some of the high costs of producing high quality linguistic annotations via a distribution
of effort and to solve some of the usability problems for annotations produced at
different sites by harmonizing their representation formats. It also provides data
from a much wider variety of genres than are often present in existing multiply-annotated
corpora of English, and all of the data in the corpus are drawn from current American
English so as to be most useful for natural language processing applications used
in the web-based environment. Further information about the pojrect is available at
the MASC website. The source texts were drawn from the open portion of the American
National Corpus Second Release, which includes written texts and spoken transcripts
of American English from a broad range of genres produced since 1990 and from the
Language Understanding Annotation Corpus, a collection of various genres inlcuding
broadcast, newswire, email, and telephone speech annotated for committed belief, event
and entity coreference, dialog acts and temporal relations. MASC Third Release includes
the the contents of MASC First Release (LDC2010T22) (82,000 words) which is also available
from LDC. There is no second release. *Data* All data in this release was annotated
for logical structure (paragraph, headings, etc.), token and sentence boundaries,
part of speech and lemma, shallow parse (noun and verb chunks) and named entities
(person, organization, location and date). Portions of the corpus were also annotated
for FrameNet frames (40k full text), Penn Treebank syntax (82k) and opinion (50k).
All annotations were either manually produced or hand-validated and represented in
ISO-GrAF standoff format. The original texts were derived from original electronic
versions in a wide variety of formats, including but not limited to Quark Express,
XML, Microsoft Word, Portable Document Format (PDF), HTML, and plain text. Transduction
procedures varied depending on the original format. As little correction or other
editorial modification as possible was applied to the text. Corrections to the text
were either made in standoff documents containing the corrected version or were reflected
in values of segmentation, token, sentence, or other segmental unit, and/or part of
speech annotation. The data are segmented into minimal regions spanning the primary
data. Minimal regions are identified as the smallest unit any of the tokenizations
applied to data references. Token annotations reference these regions as appropriate.
Sentences reference regions in primary data.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- Geographic subdivision:
United States
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ide, Nancy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Suderman, Keith
ADDED ENTRY--PERSONAL NAME
- Personal name:
Baker, Collin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Passonneau, Rebecca
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fellbaum, Christiane
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636533
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 081-702-893-346-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
LDC Spoken Language Sampler - Second Release
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC (Linguistic Data Consortium) Spoken Language Sampler - Second Release contains
samples from 20 different corpora published by LDC between 1996 and 2013. LDC distributes
a wide and growing assortment of resources for researchers, engineers and educators
whose work is concerned with human languages. Historically, most linguistic resources
were not generally available to interested researchers but were restricted to single
laboratories or to a limited number of users. Inspired by the success of selected
readily-available and well-known data sets, such as the Brown University text corpus,
LDC was founded in 1992 to provide a new mechanism for large-scale corpus development
and sharing of resources. With the support of its members, LDC is able to provide
critical services to the language research community. These services include: maintaining
the data archives, producing and distributing data via media or web downloads, negotiating
intellectual property agreements with potential information providers and maintaining
relations with other like-minded groups around the world. Resources available from
LDC include speech, text, video data and lexicons in multiple languages, as well as
software tools to facilitate the use of corpus materials. For a complete view of LDC's
publications, browse the Catalog. This sampler is available as a free download. *Data*
The LDC Spoken Language Sampler - Second Release provides speech and transcript samples
and is designed to illustrate the variety and breadth of the resources available from
the LDC Catalog. The sound files included in this release are excerpts that have been
modified in various ways relative to the original data as published by LDC: * Most
excerpts are truncated to be much shorter than the original files, typically between
1.5 and 2 minutes. * Signal amplitude has been adjusted where necessary to normalize
playback volume. * Some corpora are published in compressed form, but all samples
here are uncompressed. * Some text files are presented as images to ensure foreign
character sets display properly. * In some publications, NIST SPHERE file format is
used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file
format for compatibility with typical browser audio utilities. FLAC files have been
expanded into their wav form as well. The link for the catalog number takes you to
the catalog entry. LDC2013S05 Greybeard Greybeard is comprised of approximately 590
hours of English telephone conversation speech collected in October and November 2008
by LDC. The goal was to record new telephone conversations among subjects who had
participated in one or more previous LDC telephone collections, from Switchboard-1
(1991) through the Mixer studies (2006). LDC2013S04 GALE Phase 2 Chinese Broadcast
Conversation Speech GALE Phase 2 Chinese Broadcast Conversation Speech is comprised
of approximately 120 hours of Chinese speech from current events programming featuring
interviews, call-in programs and roundtable discussions. LDC2012S06 Turkish Broadcast
News Speech and Transcripts Turkish Broadcast News Speech and Transcripts contains
approximately 130 hours of Voice of America Turkish radio broadcasts and corresponding
transcripts. LDC2012S05 USC-SFI MALACH Interviews and Transcripts English USC-SFI
MALACH Interviews and Transcripts English contains approximately 375 hours of interviews
from 784 survivors of the Holocaust along with transcripts and other documentation.
LDC2012S04 Malto Speech and Transcripts Malto Speech and Transcripts contains approximately
8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22
males, 5 females). Also included are accompanying transcripts, English translations
and glosses for 6 hours of the collection. Malto is principally spoken in northeastern
India and Bangladesh. LDC2012S03 Digital Archive of Southern Speech Digital Archive
of Southern Speech contains approximately 370 hours of American English speech data
from 30 female speakers and 34 male speakers, along with associated metadata about
the speakers and the recordings and maps in .jpeg format relating to the recording
locations in the southern United States. LDC2012S02 TORGO Database of Dysarthric Articulation
TORGO contains approximately 23 hours of English speech data, accompanying transcripts
and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic
lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control
group. LDC2011S08 2008 NIST Speaker Recognition Evaluation Test Set 2008 NIST Speaker
Recognition Evaluation Test Set contains 942 hours of multilingual telephone speech
and English interview speech along with transcripts and other materials used as test
data in the 2008 NIST Speaker Recognition Evaluation. LDC2010S05 Asian Elephant Vocalizations
Asian Elephant Vocalizations consists of 57.5 hours of audio recordings of vocalizations
by Asian Elephants (Elephas maximus) in the Uda Walawe National Park, Sri Lanka, of
which 31.25 hours have been annotated. LDC2010S01 Fisher Spanish Speech Fisher Spanish
Speech consists of audio files covering roughly 163 hours of telephone speech from
136 native Caribbean Spanish and non-Caribbean Spanish speakers. LDC2007S18 CSLU Kids
Speech Developed at Oregon State Universitys Center for Spoken Language Understanding,
this corpus is a collection of spontaneous and prompted speech from 1100 children
from Kindergarten through Grade 10. LDC2007S15 Nationwide Speech Project A database
of speech representing regional accents and dialects of the United States. LDC2007S02
Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations
and transcripts from speakers of several nationalities. LDC2006S43 Gulf Arabic Conversational
Telephone Speech Contains 975 telephone conversations from speakers across the Persian
Gulf region and their transcriptions. LDC2004S09 NIST Meeting Pilot Corpus Speech
Collects speech and transcriptions from topical discussions in meeting settings including
complete descriptive metadata and detailed descriptions of the physical environment
in which the discussions took place. LDC2003S05 West Point Russian Speech Utterances
of sentences in Russian from 1,891 native and non-native speakers. LDC2003S03 Korean
Telephone Speech Collection of 100 telephone conversations between native Korean speakers
and their transcriptions. LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms
Tone paradigms from Yemba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language
spoken by 300,000+ people in Southwestern Cameroon. LDC96S50 CALLFRIEND Farsi A corpus
of 60 unscripted telephone calls between friends and acquaintances speaking in their
native language, Farsi. LDC96S37 CALLHOME Japanese A corpus of 120 unscripted telephone
conversations between native Japanese speakers and a corpus of associated transcripts.
LANGUAGE NOTE
- Language note:
Content in . Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636509
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 533-010-199-301-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2 was developed by
the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text
in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. This corpus contains Chinese source text and corresponding
English translations selected from broadcast conversation (BC) data collected by LDC
in 2005-2007 and transcribed by LDC or under its direction. LDC has also released
GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1 (LDC2013T11). *Data*
This release includes 20 source-translation document pairs, comprising 152,894 characters
of Chinese source text and its English translation. Data is drawn from six distinct
Chinese programs broadcast in 2005-2007 from Phoenix TV, a Hong Kong-based satellite
television station. Broadcast conversation programming is generally more interactive
than traditional news broadcasts and includes talk shows, interviews, call-in programs
and roundtable discussions. The programs in this release focus on current events topics.
The data was transcribed by LDC staff and/or transcription vendors under contract
to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the text. Data was manually
selected for translation according to several criteria, including linguistic features,
transcription features and topic features. The transcribed and segmented files were
then reformatted into a human-readable translation format and assigned to translation
vendors. Translators followed LDCs Chinese to English translation guidelines. Bilingual
LDC staff performed quality control procedures on the completed translations. Source
data and translations are distributed in TDF format. TDF files are tab-delimited files
containing one segment of text along with meta information about that segment. Each
field in the TDF file is described in TDF_format.text. All data are encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- General subdivision:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636517
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 222-157-378-519-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
MADCAT Phase 3 Training Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phase
3 Training Set contains all training data created by the Linguistic Data Consortium
(LDC) to support Phase 3 of the DARPA MADCAT Program. The data in this release consists
of handwritten Arabic documents, scanned at high resolution and annotated for the
physical coordinates of each line and token. Digital transcripts and English translations
of each document are also provided, with the various content and annotation layers
integrated in a single MADCAT XML output. The goal of the MADCAT program is to automatically
convert foreign text images into English transcripts. MADCAT Phase 3 data was collected
from Arabic source documents in three genres: newswire, weblog and newsgroup text.
Arabic speaking scribes copied documents by hand, following specific instructions
on writing style (fast, normal, careful), writing implement (pen, pencil) and paper
(lined, unlined). Prior to assignment, source documents were processed to optimize
their appearance for the handwriting task, which resulted in some original source
documents being broken into multiple pages for handwriting. Each resulting handwritten
page was assigned to up to five independent scribes, using different writing conditions.
The handwritten, transcribed documents were next checked for quality and completeness,
then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital
version of the handwritten document. The scanned images were then annotated to indicate
the physical coordinates of each line and token. Explicit reading order was also labeled,
along with any errors produced by the scribes when copying the text. The final step
was to produce a unified data format that takes multiple data streams and generates
a single MADCAT XML output file which contains all required information. The resulting
madcat.xml file contains distinct components: a text layer that consists of the source
text, tokenization and sentence segmentation, an image layer that consists of bounding
boxes, a scribe demographic layer that consists of scribe ID and partition (train/test)
and a document metadata layer. LDC has also released: * MADCAT Phase 1 Training Set
(LDC2012T15) * MADCAT Phase 2 Training Set (LDC2013T09) * MADCAT Chinese Pilot Training
Set (LDC2014T13) *Data* This release includes 4,540 annotation files in both GEDI
XML and MADCAT XML formats (gedi.xml and madcat.xml) along with their corresponding
scanned image files in TIFF format. The annotation results in GEDI XML files include
ground truth annotations and source transcripts. Files are named as follows: * galeID_page#_scribeID.{tif|gedi.xml|madcat.xml}
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Written Arabic
- General subdivision:
Data processing.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- General subdivision:
Translating into English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doermann, Dave
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636541
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 889-136-751-435-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast Conversation Speech Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 was developed by the Linguistic
Data Consortium (LDC) and is comprised of approximately 128 hours of Arabic broadcast
conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat,
Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast Conversation
Transcripts Part 2 (LDC2013T17). Broadcast audio for the GALE program was collected
at LDCs Philadelphia, PA USA facilities and at three remote collection sites: Hong
Kong University of Science and Technology, Hong Kong, Republic of China (Chinese),
Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined
local and outsourced broadcast collection supported GALE at a rate of approximately
300 hours per week of programming from more than 50 broadcast sources for a total
of over 30,000 hours of collected broadcast audio over the life of the program. LDC's
local broadcast collection system is highly automated, easily extensible and robust
and capable of collecting, processing and evaluating hundreds of hours of content
from several dozen sources per day. The broadcast material is served to the system
by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems
(DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television
(CATV) feeds. The mapping between receivers and recorders is dynamic and modular.
All signal routing is performed under computer control, using a 256x64 A/V matrix
switch. Programs are recorded in a high bandwidth A/V format and are then processed
to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized
closed captions (in the case of North American English) and to generate automatic
speech recognition (ASR) output. An overview of the system, the sources recorded and
the configuration of the recording laboratory are contained in the Guidelines for
Broadcast Audio Collection Version 3.0 included in this release. LDC designed a portable
platform for remote broadcast collection. This is a TiVO-style digital video recording
(DVR) system that records two streams of A/V material simultaneously. It supports
analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside
of the United States. It has a small footprint, weighs less than 30 pounds and can
be transported as carry-on luggage. Medianet collected Arabic programming from across
the Gulf region using its internal system and LDC's portable broadcast collection
platform installed in 2008. The portable platform deployed at the Medianet Tunisian
collection facility collected multiple streams of regional Arabic programming from
various sources. MTC collected Arabic programming using its internal collection system.
LDC has also released GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 (LDC2013S02)
and GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 (LDC2013T04). *Data*
The broadcast conversation recordings in this release feature interviews, call-in
programs and roundtable discussions focusing principally on current events from the
following sources: Abu Dhabi TV, based in Abu Dhabi, United Arab Emirates; Al Alam
News Channel, based in Iran; Al Arabiya, a news television station based in Dubai;
Aljazeera, a regional broadcaster located in Doha, Qatar; Lebanese Broadcasting Corporation,
a Lebanese television station; Oman TV, a national broadcaster located in the Sultanate
of Oman; Saudi TV, a national television station based in Saudi Arabia; and Syria
TV, the national television station in Syria. A table showing the number of programs
and hours recorded from each source is contained in the readme file. This release
contains 141 audio files presented in FLAC-compressed Waveform Audio File format (.flac),
16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker
following Audit Procedure Specification Version 2.0 which is included in this release.
The broadcast auditing process served three principal goals: as a check on the operation
of the broadcast collection system equipment by identifying failed, incomplete or
faulty recordings; as an indicator of broadcast schedule changes by identifying instances
when the incorrect program was recorded; and as a guide for data selection by retaining
information about a program's genre, data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u ara d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 522-278-791-158-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2 was developed by the
Linguistic Data Consortium (LDC) and contains transcriptions of approximately 128
hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet,
Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous
Language Exploitation) program. Corresponding audio data is released as GALE Phase
2 Arabic Broadcast Conversation Speech Part 2 (LDC2013S07). The source broadcast conversation
recordings feature interviews, call-in programs and round table discussions focusing
principally on current events from the following sources: Abu Dhabi TV (based in Abu
Dhabi, United Arab Emirates), Al Alam News Channel (based in Iran), Al Arabiya (a
news television station based in Dubai), Aljazeera (a regional broadcaster located
in Doha, Qatar), Lebanese Broadcasting Corporation (a Lebanese television station),
Oman TV (a national broadcaster located in the Sultanate of Oman), Saudi TV (a national
television station based in Saudi Arabia) and Syria TV, the national television station
in Syria. *Data* The transcript files are in plain-text, tab-delimited format (TDF)
with UTF-8 encoding, and the transcribed data totals 763,945 tokens. The transcripts
were created with the LDC-developed transcription tool, XTrans, a multi-platform,
multilingual, multi-channel transcription tool that supports manual transcription
and annotation of audio recordings. XTrans is available from the following link, http://www.ldc.upenn.edu/tools/XTrans/downloads/.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDCs quick transcription guidelines (QTR)
and quick rich transcription specification (QRTR) both of which are included in the
documentation with this release. QTR transcription consists of quick (near-)verbatim,
time-aligned transcripts plus speaker identification with minimal additional mark-up.
It does not include sentence unit annotation. QRTR annotation adds structural information
such as topic boundaries and manual sentence unit annotation to the core components
of a quick transcript. Files with QTR as part of the filename were developed using
QTR transcription. Files with QRTR in the filename indicate QRTR transcription. LDC
has also released GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 (LDC2013S02
) and GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 (LDC2013T04).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636568
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 857-492-590-583-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Semantic Textual Similarity (STS) 2013 Machine Translation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Semantic Textual Similarity (STS) 2013 Machine Translation was developed as part of
the STS 2013 Shared Task which was held in conjunction with *SEM 2013, the second
joint conference on lexical and computational semantics organized by the ACL (Association
of Computational Linguistics) interest groups SIGLEX and SIGSEM. It is comprised of
one text file containing 750 English sentence pairs translated from the Arabic and
Chinese newswire and web data sources. The goal of the Semantic Textual Similarity
(STS) task was to create a unified framework for the evaluation of semantic textual
similarity modules and to characterize their impact on natural language processing
(NLP) applications. STS measures the degree of semantic equivalence. The STS task
was proposed as an attempt at creating a unified framework that allows for an extrinsic
evaluation of multiple semantic components that otherwise have historically tended
to be evaluated independently and without characterization of impact on NLP applications.
More information is available at the STS 2013 Shared Task homepage. *Data* The source
data is Arabic and Chinese newswire and web data collected by LDC that was translated
and used in the DARPA GALE (Global Autonomous Language Exploitation) program and in
several NIST Open Machine Translation evaluations. Of the 750 sentence pairs, 150
pairs are from the GALE Phase 5 collection and 600 pairs are from NIST 2008-2012 Open
Machine Translation (OpenMT) Progress Test Sets (LDC2013T07). The data was built to
identify semantic textual similarity between two short text passages. The corpus is
comprised of two tab delimited sentences per line. The first sentence is a translation
and the second sentence is a post-edited translation. Post-editing is a process to
improve machine translation with a minimum of manual labor. The gold standard similarity
values and other STS datasets can be obtained from the STS homepage, linked above.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Agirre, Eneko
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cer, Daniel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Diab, Mona
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gonzalez-Agirre, Aitor
ADDED ENTRY--PERSONAL NAME
- Personal name:
Guo, Weiwei
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636576
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 097-396-657-690-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Broadcast News Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Broadcast News Transcripts was developed by the Linguistic Data
Consortium (LDC) and contains transcriptions of approximately 110 hours of Chinese
broadcast news speech collected in 2006 and 2007 by LDC and Hong University of Science
and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. Corresponding audio data is released as GALE Phase
2 Chinese Broadcast News Speech (LDC2013S08). The source broadcast recordings feature
news programs focusing principally on current events from the following sources: Anhui
TV, a regional television station in Mainland China, Anhui Province; China Central
TV (CCTV), a national and international broadcaster in Mainland China; and Phoenix
TV, a Hong Kong-based satellite television station. *Data* The transcript files are
in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed
data totals 1,593,049 tokens. The transcripts were created with the LDC-developed
transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription
tool that supports manual transcription and annotation of audio recordings. XTrans
is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed the LDC quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)verbatim,
time-aligned transcripts plus speaker identification with minimal additional mark-up.
It does not include sentence unit annotation. QRTR annotation adds structural information
such as topic boundaries and manual sentence unit annotation to the core components
of a quick transcript. Files with QTR as part of the filename were developed using
QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636584
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013S08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 398-416-893-097-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Broadcast News Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Broadcast News Speech was developed by the Linguistic Data Consortium
(LDC) and is comprised of approximately 126 hours of Mandarin Chinese broadcast news
speech collected in 2006 and 2007 by the Linguistic Data Consortium (LDC) and Hong
University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA
GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts
are released as GALE Phase 2 Chinese Broadcast News Transcripts (LDC2013T20). Broadcast
audio for the GALE program was collected at the Philadelphia, PA USA facilities of
LDC and at three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia)
(Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast
collection supported GALE at a rate of approximately 300 hours per week of programming
from more than 50 broadcast sources for a total of over 30,000 hours of collected
broadcast audio over the life of the program. The LDC local broadcast collection system
is highly automated, easily extensible and robust and capable of collecting, processing
and evaluating hundreds of hours of content from several dozen sources per day. The
broadcast material is served to the system by a set of free-to-air (FTA) satellite
receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast
satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between
receivers and recorders is dynamic and modular. All signal routing is performed under
computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high
bandwidth A/V format and are then processed to extract audio, to generate keyframes
and compressed audio/video, to produce time-synchronized closed captions (in the case
of North American English) and to generate automatic speech recognition (ASR) output.
An overview of the system, the sources recorded and the configuration of the recording
laboratory are contained in the Guidelines for Broadcast Audio Collection Version
3.0 included in this release. LDC designed a portable platform for remote broadcast
collection. This is a TiVO-style digital video recording (DVR) system that records
two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL)
and FTA DVB-S satellite programming and can operate outside of the United States.
It has a small footprint, weighs less than 30 pounds and can be transported as carry-on
luggage. HKUST collected Chinese broadcast programming using its internal recording
system and a portable broadcast collection platform designed by LDC and installed
at HKUST in 2006. *Data* The broadcast recordings in this release feature news programs
focusing principally on current events from the following sources: Anhui TV, a regional
television station in Mainland China, Anhui Province; China Central TV (CCTV), a national
and international broadcaster in Mainland China; and Phoenix TV, a Hong Kong-based
satellite television station. This release contains 248 audio files presented in FLAC-compressed
Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file
was audited by a native Chinese speaker following Audit Procedure Specification Version
2.0 which is included in this release. The broadcast auditing process served three
principal goals: as a check on the operation of the broadcast collection system equipment
by identifying failed, incomplete or faulty recordings; as an indicator of broadcast
schedule changes by identifying instances when the incorrect program was recorded;
and as a guide for data selection by retaining information about program genre, data
type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Spoken Chinese
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2013 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636592
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2013T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 151-738-649-048-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
OntoNotes Release 5.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2013]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2013T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
OntoNotes Release 5.0 is the final release of the OntoNotes project, a collaborative
effort between BBN Technologies, the University of Colorado, the University of Pennsylvania
and the University of Southern Californias Information Sciences Institute. The goal
of the project was to annotate a large corpus comprising various genres of text (news,
conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows)
in three languages (English, Chinese, and Arabic) with structural information (syntax
and predicate argument structure) and shallow semantics (word sense linked to an ontology
and coreference). OntoNotes Release 5.0 contains the content of earlier releases --
OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04, OntoNotes Release
3.0 LDC2009T24 and OntoNotes Release 4.0 LDC2011T03 -- and adds source data from and/or
additional annotations for, newswire (News), broadcast news (BN), broadcast conversation
(BC), telephone conversation (Tele) and web data (Web) in English and Chinese and
newswire data in Arabic. Also contained is English pivot text (Old Testament and New
Testament text). This cumulative publication consists of 2.9 million words with counts
shown in the table below. Arabic English Chinese News 300k 625k 250k BN n/a 200k 250k
BC n/a 200k 150k Web n/a 300k 150k Tele n/a 120k 100k Pivot n/a n/a 300 The OntoNotes
project built on two time-tested resources, following the Penn Treebank for syntax
and the Penn PropBank for predicate-argument structure. Its semantic representation
includes word sense disambiguation for nouns and verbs, with some word senses connected
to an ontology, and coreference. *Data* Documents describing the annotation guidelines
and the routines for deriving various views of the data from the database are included
in the documentation directory of this release. The annotation is provided both in
separate text files for each annotation layer (Treebank, PropBank, word sense, etc.)
and in the form of an integrated relational database (ontonotes-v5.0.sql.gz) with
a Python API to provide convenient cross-layer access. It is a known issue that this
release contains some non-validating XML files. The included tools, however, use a
non-validating XML parser to parse the .xml files and load the appropriate values.
*Tools* This release includes OntoNotes DB Tool v0.999 beta, the tool used to assemble
the database from the original annotation files. It can be found in the directory
tools/ontonotes-db-tool-v0.999b. This tool can be used to derive various views of
the data from the database, and it provides an API that can implement new queries
or views. Licensing information for the OntoNotes DB Tool package is included in its
source directory.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Arabic, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Computational linguistics
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Weischedel, Ralph
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitchell
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hovy, Eduard
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pradhan, Sameer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ramshaw, Lance
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taylor, Ann
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kaufman, Jeff
ADDED ENTRY--PERSONAL NAME
- Personal name:
Franchini, Michelle
ADDED ENTRY--PERSONAL NAME
- Personal name:
El-Bachouti, Mohammed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Belvin, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Houston, Ann
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2013T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u per d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636673
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 642-638-339-584-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
per
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
pes
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND Farsi Second Edition Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CALLFRIEND Farsi Second Edition Transcripts was developed by the Linguistic Data Consortium
(LDC)and consists of transcripts for approximately 42 hours of telephone conversation
(100 recordings) among native Farsi speakers. The calls were recorded in 1995 and
1996 as part of the CALLFRIEND collection, a project designed primarily to support
research in automatic language identification. One hundred native Farsi speakers living
in the continental United States made a single telephone call, lasting up to 30 minutes,
to a family member or friend living in the United States. Corresponding speech data
is available as CALLFRIEND Farsi Second Edition Speech (LDC2014S01). *Data* Transcripts
are presented in three formats: romanized transcripts (*asc.txt), Arabic-script transcripts
(*ntv.txt) and both romanized and Arabic forms in a simple XML format (*.xml). For
the *.txt files, the four main fields on each line (start-offset, end-offset, speaker-label,
transcript-text) are separated by tabs. Each file begins with a single comment line
containing the file_id string. This is followed immediately by the list of time-stamped
segments, in order according to their start-offset values, with no blank lines. The
XML form of the transcripts contains both Arabicized and romanized forms for Farsi
words.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Iranian Persian. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Persian language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miller, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cieri, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jones, Karen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u per d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636657
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 639-168-803-360-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
per
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
pes
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
CALLFRIEND Farsi Second Edition Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CALLFRIEND Farsi Second Edition Speech was developed by the Linguistic Data Consortium
(LDC) and consists of approximately 42 hours of telephone conversation (100 recordings)
among native Farsi speakers. The calls were recorded in 1995 and 1996 as part of the
CALLFRIEND collection, a project designed primarily to support research in automatic
language identification. One hundred native Farsi speakers living in the continental
United States each made a single telephone call, lasting up to 30 minutes, to a family
member or friend living in the United States. This release represents all calls from
the collection. LDC released recordings from 60 calls without transcripts in 1996
as CALLFRIEND Farsi (LDC96S50) after 20 of those calls were used as evaluation data
in the first NIST Language Recognition Evaluation (LRE). Seven of these original 60
calls were deemed unsuitable for transcription and, thus 53 of the original CF Farsi
files are included along with 47 new files. Corresponding transcripts are available
in CALLFRIEND Farsi Second Edition Speech Transcripts (LDC2014T01). *Data* All recordings
involved domestic calls routed through the automated telephone collection platform
at LDC and were stored as 2-channel (4-wire), 8-KHz mu-law samples taken directly
from the public telephone network via a T-1 circuit. Each audio file is a FLAC-compressed
MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data.
This release includes speaker information, including gender, the number of speakers
on each channel and call duration.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Iranian Persian. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Persian language
- Form subdivision:
Databases.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Persian language
- Form subdivision:
Databases.
- General subdivision:
Spoken Persian
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Automatic speech recognition
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Canavan, Alexandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zipperlen, George
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636665
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 381-030-411-602-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2 was developed
by the Linguistic Data Consortium (LDC) and contains 141,058 tokens of word aligned
Arabic and English parallel text with treebank annotations. This material was used
as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.
Parallel aligned treebanks are treebanks annotated with morphological and syntactic
structures aligned at the sentence level and the sub-sentence level. Such data sets
are useful for natural language processing and related fields, including automatic
word alignment system training and evaluation, transfer-rule extraction, word sense
disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic
studies. With respect to machine translation system development, parallel aligned
treebanks may improve system performance with enhanced syntactic parsers, better rules
and knowledge about language pairs and reduced word error rate. The source Arabic
data was translated into English. Arabic and English treebank annotations were performed
independently. The parallel texts were then word aligned. The material in this corpus
corresponds to a portion of the Arabic treebanked data in Arabic Treebank - Broadcast
News v1.0 (LDC2012T07). LDC previously released GALE Arabic-English Parallel Aligned
Treebank -- Broadcast News Part 1 (LDC2013T14). *Data* The source data consists of
Arabic broadcast news programming collected by LDC in 2007 and 2008 from Al Arabiya,
Abu Dhabi TV, Al Baghdadya TV, Al Fayha, Alhurra, Al Iraqiyah, Aljazeera, Al Ordiniyah,
Al Sharqiya, Dubai TV, Oman TV, Radio Sawa and Saudi TV. All data is encoded as UTF-8.
A count of files, words, tokens and segments is below. Language</TD> Files Words Tokens
Segments Arabic 31 110,690 141,058 7,102 Note: Word count is based on the untokenized
Arabic source. Token count is based on the ATB-tokenized Arabic source. The purpose
of the GALE word alignment task was to find correspondences between words, phrases
or groups of words in a set of parallel texts. Arabic-English word alignment annotation
consisted of the following tasks: * Identifying different types of links: translated
(correct or incorrect) and not translated (correct or incorrect) * Identifying sentence
segments not suitable for annotation, e.g., blank segments, incorrectly-segmented
segments, segments with foreign languages * Tagging unmatched words attached to other
words or phrases This release contains four types of files - raw, tokenized, treebank,
and wa. The raw format contains the original Arabic and English sentences without
any annotation. The tokenized format is the treebank tokenized version of the raw
data which may contain Empty Category tokens (treebank leaves that have the POS label
-NONE-). The treebank and wa files are treebank and word alignment annotations on
the tokenized files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u per d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636681
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 847-333-922-514-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
pes
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source was
developed by NIST Multimodal Information Group. This release contains the evaluation
sets (source data and human reference translations), DTD, scoring software, and evaluation
plan for the OpenMT 2012 test for Arabic, Chinese, Dari, Farsi, and Korean to English
on a parallel data set. The set is based on a subset of the Arabic-to-English and
Chinese-to-English progress tests from the OpenMT 2008, 2009 and 2012 evaluations
with new source data created by humans based on the English reference translation.
The package was compiled, and scoring software was developed, at NIST, making use
of newswire and web data and reference translations developed by the Linguistic Data
Consortium (LDC) and the Defense Language Institute Foreign Language Center. The objective
of the OpenMT evaluation series is to support research in, and help advance the state
of the art of, machine translation (MT) technologies -- technologies that translate
text between human languages. Input may include all forms of text. The goal is for
the output to be an adequate and fluent translation of the original. The MT evaluation
series started in 2001 as part of the DARPA TIDES (Translingual Information Detection,
Extraction) program. Beginning with the 2006 evaluation, the evaluations have been
driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important
contribution to the direction of research efforts and the calibration of technical
capabilities in MT. The Open MT evaluations are intended to be of interest to all
researchers working on the general problem of automatic translation between human
languages. To this end, they are designed to be simple, to focus on core technology
issues and to be fully supported. The 2012 task included the evaluation of five language
pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English and
Korean-to-English in two source data styles. For general information about the NIST
OpenMT evaluations, refer to the NIST OpenMT website. This evaluation kit includes
a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality
score for one (or more) MT systems. The script works by comparing the system output
translation with a set of (expert) reference translations of the same source text.
Comparison is based on finding sequences of words in the reference translations that
match word sequences in the system output translation. LDC has also released the following
related corpora: NIST 2012 Open Machine Translation (OpenMT) Evaluation (LDC2013T03)
(material from the Chinese-to-English pair track including restricted domain data)
and NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets (LDC2013T07)
(Arabic, Chinese and English test data). *Data* This release consists of 20 files,
four for each of the five languages, presented in XML with an included DTD. The four
files are source and reference data from the same source data in the following two
styles: * English-true: an English-oriented translation this requires that the text
read well and not use any idiomatic expressions in the foreign language to convey
meaning, unless absolutely necessary. * Foreign-true: a translation as close as possible
to the foreign language, as if the text had originated in that language.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Dari, Korean, Persian, English, Mandarin Chinese, Arabic, Iranian Persian,
and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Dari language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Korean language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Persian language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
NIST Multimodal Information Group
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u ara d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 789-673-729-277-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
King Saud University Arabic Speech Database
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
King Saud University Arabic Speech Database was developed by Speech Group (SG) at
King Saud University and contains 590 hours of recorded Arabic speech from 269 male
and female speakers. The utterances include read and spontaneous speech. The recordings
were conducted in varied environments representing quiet and noisy settings. *Data*
The corpus was designed principally for speaker recognition research. However, other
possible applications include first language recognition, mobile effect, multichannel
effect, and use of different type of microphones. The speech sources are word lists,
sentence lists, paragraphs and question and answer sessions. Read speech text includes
the following: * Sets of sentences devised to cover allophones of each phoneme, phonetic
balance, and differentiation of accents. * Word lists developed to minimize missing
phonemes and to represent nasals fricatives, commonly used words, and numbers. * Two
paragraphs selected because they included all letters of the alphabet and were easy
to read. Spontaneous speech was captured through question and answer sessions where
speakers answer questions displayed on screen. The questions were on general topics
such as the weather and food and included the speaker name or number. The speakers
were Saudis and non-Saudis. Among the non-Saudi participants were Arabs and non-Arabs.
All female speakers were either Saudis or non-Saudi Arabs. Male speakers included
non-Arabs from the Indian subcontinent, Africa, South East Asia and East Europe. Non-Arab
participants were required to be able to read Arabic at an acceptable level. Most
of the Non-Arab speakers were from the fourth level in the Arabic Linguistics Institute
at King Saud University. The non-Saudi participants represented 28 nationalities and
were chosen from clusters of areas or countries. Each speaker was recorded in three
different environments: in a soundproof room , in an office and in a cafeteria. The
recordings were collected via different microphones and a mobile phone and averaged
between 16-19 minutes. The recordings were done in three sessions with a time-gap
of an approximately 6 weeks. The data was verified for missing recordings, problems
with the recording system or errors in the recording process. All files are presented
as two channel 48 kHz 16-bit FLAC compressed PCM wav files. Note that sizes and file
names in the documentation are for the uncompressed wav files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alsulaiman, Mansour
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muhammad, Ghulam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Abdelkader, Bencherif Mohamed
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mahmood, Awais
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ali, Zulfiqar
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636703
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 820-520-434-845-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Broadcast News Parallel Text Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Broadcast News Parallel Text Part 1 was developed by the Linguistic
Data Consortium (LDC). Along with other corpora, the parallel text in this release
comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Chinese source text and corresponding
English translations selected from broadcast news (BN) data collected by LDC between
2005 and 2007 and transcribed by LDC or under its direction. *Data* This release includes
30 source-translation document pairs, comprising 198,350 characters of translated
material. Data is drawn from 11 distinct Chinese BN programs broadcast by China Central
TV, a national and international broadcaster in Mainland China, Jiangsu TV, a regional
television station in Mainland China, Jiangsu Province, New Tang Dynasty TV, a broadcaster
based in the United States, and Phoenix TV, a Hong-Kong based satellite television
station. The broadcast news recordings in this release focus principally on current
events. The data was transcribed by LDC staff and/or transcription vendors under contract
to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the text. Data was manually
selected for translation according to several criteria, including linguistic features,
transcription features and topic features. The transcribed and segmented files were
then reformatted into a human-readable translation format and assigned to translation
vendors. Translators followed the Chinese to English translation guidelines developed
by LDC. Bilingual LDC staff performed quality control procedures on the completed
translations. Source data and translations are distributed in TDF format. TDF files
are tab-delimited files containing one segment of text along with meta information
about that segment. Each field in the TDF file is described in TDF_format.text. All
data are encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Machine translating
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Chinese language
- Form subdivision:
Databases.
- General subdivision:
Translating into English
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Machine translating
- Form subdivision:
Databases.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636711
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 642-473-657-451-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web was developed
by the Linguistic Data Consortium (LDC) and contains 344,680 tokens of word aligned
Arabic and English parallel text enriched with linguistic tags. This material was
used as training data in the DARPA GALE (Global Autonomous Language Exploitation)
program. Some approaches to statistical machine translation include the incorporation
of linguistic knowledge in word aligned text as a means to improve automatic word
alignment and machine translation quality. This is accomplished with two annotation
schemes: alignment and tagging. Alignment identifies minimum translation units and
translation relations by using minimum-match and attachment annotation approaches.
A set of word tags and alignment link tags are designed in the tagging scheme to describe
these translation units and relations. Tagging adds contextual, syntactic and language-specific
features to the alignment annotation. Other releases available in this series are:
* GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and
Web (LDC2012T16) * GALE Chinese-English Word Alignment and Tagging Training Part 2
-- Newswire (LDC2012T20) * GALE Chinese-English Word Alignment and Tagging Training
Part 3 -- Web (LDC2012T24) * GALE Chinese-English Word Alignment and Tagging Training
Part 4 -- Web (LDC2013T05) * GALE Chinese-English Word Alignment and Tagging -- Broadcast
Training Part 1 (LDC2013T23) *Data* This release consists of Arabic source newswire
and web data collected by LDC in 2006 - 2008. The distribution by genre, words, character
tokens and segments appears below: Language</TD> Genre Docs Words CharTokens Segments
Arabic WB 119 59,696 81,620 4,383 Arabic NW 717 198,621 263,060 8,423 Note that word
count is based on the untokenized Arabic source, and token count is based on the tokenized
Arabic source. The Arabic word alignment tasks consisted of the following components:
* Normalizing tokenized tokens as needed * Identifying different types of links *
Identifying sentence segments not suitable for annotation * Tagging unmatched words
attached to other words or phrases
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Arabic. Documentation in English.
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
Arabic language
- Form subdivision:
Databases.
- General subdivision:
Data processing
SUBJECT ADDED ENTRY--TOPICAL TERM
- Topical term or geographic name as entry element:
English language
- General subdivision:
Data processing.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636754
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 640-546-772-297-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ETS Corpus of Non-Native Written English
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ETS Corpus of Non-Native Written English was developed by Educational Testing Service
and is comprised of 12,100 English essays written by speakers of 11 non-English native
languages as part of an international test of academic English proficiency, TOEFL
(Test of English as a Foreign Language). The test includes reading, writing, listening,
and speaking sections and is delivered by computer in a secure test center. This release
contains 1,100 essays for each of the 11 native languages sampled from eight topics
with information about the score level (low/medium/high) for each essay. The corpus
was developed with the specific task of native language identification in mind, but
is likely to support tasks and studies in the educational domain, including grammatical
error detection and correction and automatic essay scoring, in addition to a broad
range of research studies in the fields of natural language processing and corpus
linguistics. For the task of native language identification, the following division
is recommended: 82% as training data, 9% as development data and 9% as test data,
split according to the file IDs accompanying the data set. *Data* The data is sampled
from essays written in 2006 and 2007 by test takers whose native languages were Arabic,
Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish.
The essays are presented in both original raw and tokenized forms and presented in
UTF-8 formatted text files. Also included are the prompts (topics) for the essays
and metadata about the test takers' proficiency level.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Blanchard, Daniel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tetreault, Joel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Higgins, Derrick
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cahill, Aoife
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chodorow, Martin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636738
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 382-492-972-333-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Domain-Specific Hyponym Relations
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Domain-Specific Hyponym Relations was developed by the Shaanxi Province Key Laboratory
of Satellite and Terrestrial Network Technology at Xian Jiaotung University, Xian,
Shaanxi, China. It provides more than 5,000 English hyponym relations in five domains
including data mining, computer networks, data structures, Euclidean geometry and
microbiology. All hypernym and hyponym words were taken from Wikipedia article titles.
A hyponym relation is a word sense relation that is an IS-A relation. For example,
dog is a hyponym of animal and binary tree is a hyponym of tree structure. Among the
applications for domain-specific hyponym relations are taxonomy and ontology learning,
query result organization in a faceted search and knowledge organization and automated
reasoning in knowledge-rich applications. *Data* The data is presented in XML format,
and each file provides hyponym relations in one domain. Within each file, the term,
Wikipedia URL, hyponym relation and the names of the hyponym and hypernym words are
included. The distribution of terms and relations is set forth in the table below:
Dataset Terms Hyponym Relations Data Mining 278 364 Computer Network 336 399 Data
Structure 315 578 Euclidean Geometry 455 690 Microbiology 1,028 3,533
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wei, Bifan
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wang, Chenchen
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u cze d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 310-213-848-753-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
USC-SFI MALACH Interviews and Transcripts Czech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of
Southern California Shoah Foundation Institute (USC-SFI) and the University of West
Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project.
It contains approximately 229 hours of interviews from 420 interviewees along with
transcripts and other documentation. Inspired by his experience making Schindlers
List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation
in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust.
While most of those who gave testimony were Jewish survivors, the Foundation also
interviewed homosexual survivors, Jehovahs Witness survivors, liberators and liberation
witnesses, political prisoners, rescuers and aid providers, Roma and Sinti (Gypsy)
survivors, survivors of eugenics policies, and war crimes trials participants. Within
several years, the Visual History Archive held nearly 52,000 video testimonies in
32 languages representing 56 countries. It is the largest archive of its kind in the
world. In 2006, the Foundation became part of the Dana and David Dornsife College
of Letters, Arts and Sciences at the University of Southern California in Los Angeles
and was renamed as the USC Shoah Foundation Institute for Visual History and Education.
The goal of the MALACH project was to develop methods for improved access to large
multinational spoken archives. The focus was advancing the state of the art of automatic
speech recognition and information retrieval. The characteristics of the USC-SFI collection
-- unconstrained, natural speech filled with disfluencies, heavy accents, age-related
coarticulations, un-cued speaker and language switching and emotional speech -- were
considered well-suited for that task. The work centered on five languages: English,
Czech, Russian, Polish and Slovak. USC-SFI MALACH Interviews and Transcripts Czech
was developed for the Czech speech recognition experiments. LDC has also released
USC-SFI MALACH Interviews and Transcripts English (LDC2012S05). *Data* The speech
data in this release was collected beginning in 1994 under a wide variety of conditions
ranging from quiet to noisy (e.g., airplane overflights, wind noise, background conversations
and highway noise). Original interviews were recorded on Sony Beta SP tapes, then
digitized into a 3 MB/s MPEG-1 stream with 128 kb/s (44 kHz) stereo audio. The sound
files in this release are single channel FLAC compressed PCM WAV format at a sampling
frequency of 16 kHz. Approximately 570 of all USC-SFI collected interviews are in
Czech and average approximately 2.25 hours each. The interviews sessions in this release
are divided into a training set (400 interviews) and a test set (20 interviews). The
first fifteen minutes of the second tape from each training interview (approximately
30 total minutes of speech) were transcribed in .trs format using Transcriber 1.5.1.
The test interviews were transcribed completely. Thus the corpus consists of 229 hours
of speech (186 hours of training material plus 43 hours of test data) with 143 hours
transcribed (100 hours of training material plus 43 hours of test data). Certain interviews
include speech from family members in addition to that of the subject and the interviewer.
Accordingly, the corpus contains speech from more than 420 speakers, who are more
or less equally distributed between males and females.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Czech. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Psutka, Josef V.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Psutka, Josef
ADDED ENTRY--PERSONAL NAME
- Personal name:
Vlasta, Radová
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ircing, Pavel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jindřich, Matoušek
ADDED ENTRY--PERSONAL NAME
- Personal name:
Luděk, Müller
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636762
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 826-696-438-989-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Arabic-English Parallel Aligned Treebank -- Web Training
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Arabic-English Parallel Aligned Treebank -- Web Training was developed by the
Linguistic Data Consortium (LDC) and contains 69,766 tokens of word aligned Arabic
and English parallel text with treebank annotations. This material was used as training
data in the DARPA GALE (Global Autonomous Language Exploitation) program. Parallel
aligned treebanks are treebanks annotated with morphological and syntactic structures
aligned at the sentence level and the sub-sentence level. Such data sets are useful
for natural language processing and related fields, including automatic word alignment
system training and evaluation, transfer-rule extraction, word sense disambiguation,
translation lexicon extraction and cultural heritage and cross-linguistic studies.
With respect to machine translation system development, parallel aligned treebanks
may improve system performance with enhanced syntactic parsers, better rules and knowledge
about language pairs and reduced word error rate. In this release, the source Arabic
data was translated into English. Arabic and English treebank annotations were performed
independently. The parallel texts were then word aligned. LDC previously released
Arabic-English Parallel Aligned Treebanks as follows: * Newswire * Broadcast News
Part 1 * Broadcast News Part 2 *Data* This release consists of Arabic source web data
(newsgroups, weblogs) collected by LDC in 2004 and 2005. All data is encoded as UTF-8.
A count of files, words, tokens and segments is below. Language Files Words Tokens
Segments Arabic 162 46,710 69,766 3,178 Note: Word count is based on the untokenized
Arabic source, token count is based on the ATB-tokenized Arabic source. The purpose
of the GALE word alignment task was to find correspondences between words, phrases
or groups of words in a set of parallel texts. Arabic-English word alignment annotation
consisted of the following tasks: * Identifying different types of links: translated
(correct or incorrect) and not translated (correct or incorrect) * Identifying sentence
segments not suitable for annotation, e.g., blank segments, incorrectly-segmented
segments, segments with foreign languages * Tagging unmatched words attached to other
words or phrases This release contains four types of files - raw, tokenized, treebank,
and wa. The raw format contains the original Arabic and English sentences without
any annotation. The tokenized format is the treebank tokenized version of the raw
data. It may contain Empty Category tokens (treebank leaves that have the POS label
-NONE-). The treebank and wa files are treebank and word alignment annotations on
the tokenized files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636789
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 811-846-772-709-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
HyTER Networks of Selected OpenMT08/09 Sentences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
HyTER Networks of Selected OpenMT08/09 Progress Set Sentences was developed by SDL
and contains HyTER (Hybrid Translation Edit Rate) networks for 102 selected source
Arabic and Chinese sentences from OpenMT08 and OpenMT09 Progress Set data. HyTER is
an evaluation metric based on large reference networks created by an annotation tool
that allows users to develop an exponential number of correct translations for a given
sentence. Reference networks can be used as a foundation for developing improved machine
translation evaluation metrics and for automating the evaluation of human translation
efficiency. *Data* The source material is comprised of Arabic and Chinese newswire
and web data collected by LDC in 2007. Annotators created meaning-equivalent annotations
under three annotation protocols. In the first protocol, foreign language native speakers
built English networks starting from foreign language sentences. In the second, English
native speakers built English networks from the best translation of a foreign language
sentence as identified by NIST (National Institute of Standards and Technology). In
the third protocol, English native speakers built English networks starting from the
best translation, but those annotators also had access to three additional, independently
produced human translations. Networks created by different annotators for each sentence
were combined and evaluated. This release includes the source sentences and four human
reference translations produced by LDC in XML format, along with five machine translation
system outputs representing a variety of system architectures and performance, and
the human post-edited output of those systems also presented in XML.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Arabic, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dreyer, Markus
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcu, Daniel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636770
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 019-953-960-209-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Arabic-English Word Alignment Training Part 2 -- Newswire
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Arabic-English Word Alignment Training Part 2 -- Newswire was developed by the
Linguistic Data Consortium (LDC) and contains 162,359 tokens of word aligned Arabic
and English parallel text enriched with linguistic tags. This material was used as
training data in the DARPA GALE (Global Autonomous Language Exploitation) program.
Some approaches to statistical machine translation include the incorporation of linguistic
knowledge in word aligned text as a means to improve automatic word alignment and
machine translation quality. This is accomplished with two annotation schemes: alignment
and tagging. Alignment identifies minimum translation units and translation relations
by using minimum-match and attachment annotation approaches. A set of word tags and
alignment link tags are designed in the tagging scheme to describe these translation
units and relations. Tagging adds contextual, syntactic and language-specific features
to the alignment annotation. Other releases available in this series are: * GALE Chinese-English
Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16) * GALE
Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20)
* GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24)
* GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (LDC2013T05)
* GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 (LDC2013T23)
* GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web (LDC2014T05)
*Data* This release consists of Arabic source newswire collected by LDC in 2004 -
2006 and 2008. The distribution by genre, words, character tokens and segments appears
below: Language Genre Files Words CharTokens Segments Arabic NW 1,126 112,318 162,359
5,349 Note that word count is based on the untokenized Arabic source, and token count
is based on the tokenized Arabic source. The Arabic word alignment tasks consisted
of the following components: * Identifying and correcting incorrectly tokenized tokens
* Identifying different types of links * Identifying sentence segments not suitable
for annotation, such as those that were blank, incorrectly-segmented or containing
other languages * Tagging unmatched words attached to other words or phrases
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Standard Arabic, and Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636339
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 838-711-181-871-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Hispanic-English Database
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Hispanic-English Database contains approximately 30 hours of English and Spanish conversational
and read speech with transcripts (24 hours) and metadata collected from 22 non-native
English speakers between 1996 and 1998. The corpus was developed by Entropic Research
Laboratory, Inc., a developer of speech recognition and speech synthesis software
toolkits that was acquired by Microsoft in 1999. Participants were adult native speakers
of Spanish as spoken in Central America and South America who resided in the Palo
Alto, California area, had lived in the United States for at least one year and demonstrated
a basic ability to understand, read and speak English. They read a total of 2200 sentences,
50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset
of the materials in LATINO-40 Spanish Read News, and the English sentence prompts
were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises
similar to those used in English second language instruction and designed to engage
the speakers in collaborative, problem-solving activities. *Data* Read speech was
recorded on two wideband channels with a Shure SM10A head-mounted microphone in a
quiet laboratory environment. The conversational speech was simultaneously recorded
on four channels, two of which were used to place phone calls to each subject in two
separate offices and to record the incoming speech of the two channels into separate
files. The audio was originally saved under the Entropic Audio (ESPS) format using
a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed
.wav files from the ESPS format. ESPS headers were removed and are presented in this
release as *.hdr files that include demographic and technical data. Transcripts were
developed with the Entropic Annotator tool and are time-aligned with speaker turns.
The transcription conventions were based on those used in the LDC Switchboard and
CALLHOME collections. Transcript files are denoted with a .lab extension. Data files
and their corresponding label files are stored in subdirectories named using a speaker-pair
id and session number. The first three letters identify the speaker on channel A.
The last three letters identify the speaker on channel B. Wideband audio files contain
*.wb.flac in their file name, and narrow band audio files are denoted with a *.nb.flac
in the file name.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Byrne, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Knodt, Eva
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bernstein, Jared
ADDED ENTRY--PERSONAL NAME
- Personal name:
Emami, Farzhad
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636797
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 970-706-902-140-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Broadcast News Parallel Text Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 was developed by the Linguistic
Data Consortium (LDC). Along with other corpora, the parallel text in this release
comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Chinese source text and corresponding
English translations selected from broadcast news (BN) data collected by LDC between
2005 and 2007 and transcribed by LDC or under its direction. *Data* This release includes
30 source-translation document pairs, comprising 206,737 characters of translated
material. Data is drawn from 12 distinct Chinese BN programs broadcast by China Central
TV, a national and international broadcaster in Mainland China; New Tang Dynasty TV,
a broadcaster based in the United States; and Phoenix TV, a Hong-Kong based satellite
television station. The broadcast news recordings in this release focus principally
on current events. The data was transcribed by LDC staff and/or transcription vendors
under contract to LDC in accordance with Quick Rich Transcription guidelines developed
by LDC. Transcribers indicated sentence boundaries in addition to transcribing the
text. Data was manually selected for translation according to several criteria, including
linguistic features, transcription features and topic features. The transcribed and
segmented files were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDC's Chinese to English translation
guidelines. Bilingual LDC staff performed quality control procedures on the completed
translations. Source data and translations are distributed in TDF format. TDF files
are tab-delimited files containing one segment of text along with meta information
about that segment. Each field in the TDF file is described in TDF_format.text. All
data are encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636819
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 637-196-362-554-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Abstract Meaning Representation (AMR) Annotation Release 1.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Abstract Meaning Representation (AMR) Annotation Release 1.0 was developed by the
Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's
Computational Language and Educational Research group and the Information Sciences
Institute at the University of Southern California. It contains a sembank (semantic
treebank) of over 13,000 English natural language sentences from newswire, weblogs
and web discussion forums. AMR captures “who is doing what to whom” in a sentence.
Each sentence is paired with a graph that represents its whole-sentence meaning in
a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence
coreference, named entity annotation, modality, negation, questions, quantities, and
so on to represent the semantic structure of a sentence largely independent of its
syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release
2.0 (LDC2017T10). *Data* The source data includes discussion forums collected for
the DARPA BOLT program, Wall Street Journal and translated Xinhua news texts, various
newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE
program. The following table summarizes the number of training, dev, and test AMRs
for each dataset in the release. Totals are also provided by partition and dataset:
Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Weblog and WSJ 0 100
100 200 BOLT DF English 1703 210 229 2142 2009 Open MT 204 0 0 204 Xinhua MT 741 99
86 926 Totals 3709 542 548 4799 For those interested in a utilizing a standard/community
partition for AMR research (for instance in development of semantic parsers), data
in the "split" directory contains 13,051 AMRs divided roughly 80/10/10 into training/dev/test
partitions, with most smaller datasets assigned to one of the splits as a whole. Note
that splits observe document boundaries. The "unsplit" directory contains the same
13,051 AMRs with no train/dev/test partition.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Knight, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Baranescu, Laura
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bonial, Claire
ADDED ENTRY--PERSONAL NAME
- Personal name:
Georgescu, Madalina
ADDED ENTRY--PERSONAL NAME
- Personal name:
Griffitt, Kira
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hermjakob, Ulf
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcu, Daniel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schneider, Nathan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636800
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 931-416-601-272-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
MADCAT Chinese Pilot Training Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
CONTENT TYPE
- Content type code:
still image
- Content type code:
sti
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Chinese
Pilot Training Set contains all training data created by the Linguistic Data Consortium
(LDC) to support a Chinese pilot collection in the DARPA MADCAT Program. The data
in this release consists of handwritten Chinese documents, scanned at high resolution
and annotated for the physical coordinates of each line and token. Digital transcripts
and English translations of each document are also provided, with the various content
and annotation layers integrated in a single MADCAT XML output. The goal of the MADCAT
program was to automatically convert foreign text images into English transcripts.
MADCAT Chinese pilot data was collected from Chinese source documents in three genres:
newswire, weblog and newsgroup text. Chinese speaking "scribes" copied documents by
hand, following specific instructions on writing style (fast, normal, careful), writing
implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents
were processed to optimize their appearance for the handwriting task, which resulted
in some original source documents being broken into multiple "pages" for handwriting.
Each resulting handwritten page was assigned to up to five independent scribes, using
different writing conditions. The handwritten, transcribed documents were next checked
for quality and completeness, then each page was scanned at a high resolution (600
dpi, greyscale) to create a digital version of the handwritten document. The scanned
images were then annotated to indicate the physical coordinates of each line and token.
Explicit reading order was also labeled, along with any errors produced by the scribes
when copying the text. The final step was to produce a unified data format that takes
multiple data streams and generates a single MADCAT XML output file which contains
all required information. The resulting madcat.xml file contains distinct components:
a text layer that consists of the source text, tokenization and sentence segmentation;
an image layer that consist of bounding boxes; a scribe demographic layer that consists
of scribe ID and partition (train/test); and a document metadata layer. LDC has also
released: * MADCAT Phase 1 Training Set (LDC2012T15) * MADCAT Phase 2 Training Set
(LDC2013T09) * MADCAT Phase 3 Training Set (LDC2013T15) *Data* This release includes
22,284 annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml)
along with their corresponding scanned image files in TIFF format. The annotation
results in GEDI XML files include ground truth annotations and source transcripts.
Files are named as follows: * galeID_page#_scribeID.{tif|gedi.xml|madcat.xml}
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Pictures
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Doermann, Dave
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u amh d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636827
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 180-783-854-340-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
amh
- Language code of text/sound track or separate title:
hat
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
bos
- Language code of text/sound track or separate title:
hrv
- Language code of text/sound track or separate title:
geo
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
tur
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
hau
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
ukr
- Language code of text/sound track or separate title:
pus
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
amh
- Language code of text/sound track or separate title:
hat
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
urd
- Language code of text/sound track or separate title:
bos
- Language code of text/sound track or separate title:
hrv
- Language code of text/sound track or separate title:
kat
- Language code of text/sound track or separate title:
kor
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
tur
- Language code of text/sound track or separate title:
vie
- Language code of text/sound track or separate title:
yue
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
hau
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
ukr
- Language code of text/sound track or separate title:
pus
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2009 NIST Language Recognition Evaluation Test Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2009 NIST Language Recognition Evaluation Test Set contains approximately 215 hours
of conversational telephone speech and radio broadcast conversation collected by the
Linguistic Data Consortium (LDC) in the following 23 languages and dialects: Amharic,
Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English
(Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese,
Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese. The goal of the NIST (National
Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to
establish the baseline of current performance capability for language recognition
of conversational telephone speech and to lay the groundwork for further research
efforts in the field. NIST conducted language recognition evaluations in 1996, 2003,
2005 and 2007. The 2009 evaluation increased the number of target languages. Most
of the test data originated from multilingual Voice of America (VOA) radio broadcasts
assessed as being of telephone bandwidth in addition to conversational telephone speech.
Further information regarding this evaluation can be found in the evaluation plan
which is included in the documentation for this release. LDC released the prior LREs
as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language
Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation Test
Set (LDC2009S04) * 2007 NIST Language Recognition Evaluation Supplemental Training
Set (LDC2009S05) *Data* The VOA speech data was collected by LDC in 2000 and 2001
and constitutes approximately 75% of the test set. The telephone speech was taken
from LDC's Mixer 3 collection recorded between 2005 and 2007. All test speech segments
are presented as a sampled data stream in standard 8-bit 8-kHz μ-law format. Each
segment is stored separately in a single channel SPHERE format file. The test segments
contain three nominal durations of speech: 3 seconds, 10 seconds and 30 seconds. Actual
speech durations vary, but were constrained to be within the ranges of 2-4 seconds,
7-13 seconds and 23-35 seconds, respectively. Non-speech portions of each segment
were included in each segment so that a segment contained a continuous sample of the
source recording. Therefore, the test segments may be significantly longer than the
speech duration, depending on how much non-speech was included.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Amharic, Haitian, English, French, Hindi, Spanish, Urdu, Bosnian, Croatian,
Georgian, Korean, Portuguese, Turkish, Vietnamese, Yue Chinese, Dari, Persian, Hausa,
Mandarin Chinese, Russian, Ukrainian, and Pushto. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Greenberg, Craig
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brandschain, Linda
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636835
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 281-668-197-339-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Arabic-English Word Alignment Training Part 3 -- Web
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Arabic-English Word Alignment Training Part 3 -- Web was developed by the Linguistic
Data Consortium (LDC) and contains 217,158 tokens of word aligned Arabic and English
parallel text enriched with linguistic tags. This material was used as training data
in the DARPA GALE (Global Autonomous Language Exploitation) program. Some approaches
to statistical machine translation include the incorporation of linguistic knowledge
in word aligned text as a means to improve automatic word alignment and machine translation
quality. This is accomplished with two annotation schemes: alignment and tagging.
Alignment identifies minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word tags and alignment
link tags are designed in the tagging scheme to describe these translation units and
relations. Tagging adds contextual, syntactic and language-specific features to the
alignment annotation. Other releases available in this series are: * GALE Chinese-English
Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16) * GALE
Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20)
* GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24)
* GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (LDC2013T05)
* GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 (LDC2013T23)
* GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web (LDC2014T05)
* GALE Arabic-English Word Alignment Training Part 2 -- Newswire (LDC2014T10) *Data*
This release consists of Arabic source web data collected by LDC. The distribution
by genre, words, character tokens and segments appears below: Language Genre Files
Words CharTokens Segments Arabic WB 2,449 154,144 217,158 7,332 Note that word count
is based on the untokenized Arabic source, and token count is based on the tokenized
Arabic source. The Arabic word alignment tasks consisted of the following components:
* Normalizing tokenized tokens as needed * Identifying different types of links *
Identifying sentence segments not suitable for annotation * Tagging unmatched words
attached to other words or phrases
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic, Standard Arabic, and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636843
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 629-249-332-330-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Newswire Parallel Text Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Newswire Parallel Text Part 1 was developed by the Linguistic
Data Consortium (LDC). Along with other corpora, the parallel text in this release
comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains 117,173 tokens of Chinese source text
and corresponding English translations selected from newswire data collected by LDC
in 2007 and transcribed by LDC or under its direction. *Data* This release includes
167 source-translation document pairs, comprising 117,173 tokens of translated data.
Data is drawn from four distinct Chinese newswire sources: China News Service, Guangming
Daily, People's Daily and People's Liberation Army Daily. Data was manually selected
for translation according to several criteria, including linguistic features and topic
features. The files were formatted into a human-readable translation format and assigned
to translation vendors. Translators followed LDC's Chinese to English translation
guidelines. Bilingual LDC staff performed quality control procedures on the completed
translations. Source data and translations are distributed in TDF format. TDF files
are tab-delimited files containing one segment of text along with meta information
about that segment. Each field in the TDF file is described in TDF_format.text. All
data are encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636851
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 043-495-621-872-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TAC KBP Reference Knowledge Base
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TAC KBP Reference Knowledge Base was developed by the Linguistic Data Consortium (LDC)
in support of the NIST-sponsored TAC-KBP evaluation series. It is a knowledge base
built from English Wikipedia articles and their associated infoboxes and covers over
800,000 entities. LDC also released TAC KBP Spanish Cross-lingual Entity Linking -
Comprehensive Training and Evaluation Data 2012-2014 (LDC2016T26.) TAC (Text Analysis
Conference) is a series of workshops organized by NIST (the National Institute of
Standards and Technology) to encourage research in natural language processing and
related applications by providing a large test collection, common evaluation procedures,
and a forum for researchers to share their results. TAC's KBP track (Knowledge Base
Population) encourages the development of systems that can match entities mentioned
in natural texts with those appearing in a knowledge base and extract novel information
about entities from a document collection and add it to a new or existing knowledge
base. Consult the LDC TAC-KBP project page for further information about LDC's resource
development for the TAC-KBP program. *Data* The source data (Wikipedia infoboxes and
articles) was taken from an October 2008 snapshot of Wikipedia. TAC KBP Reference
Knowledge Base contains a set of entities, each with a canonical name and title for
the Wikipedia page, an entity type, an automatically parsed version of the data from
the infobox in the entity's Wikipedia article, and a stripped version of the text
of the Wiki article. Each entity is assigned one of four types: PER (person), ORG
(organization), GPE (geo-political entity) and UKN (unknown). All data files are presented
as UTF-8 encoded XML.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Simpson, Heather
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ellis, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Parker, Robert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u ara d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 476-771-845-495-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast News Transcripts Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast News Transcripts Part 1 was developed by the Linguistic
Data Consortium (LDC) and contains transcriptions of approximately 165 hours of Arabic
broadcast news speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia
and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) program. Corresponding audio data is released as GALE Phase 2 Arabic
Broadcast News Speech Part 1 (LDC2014S07). The broadcast recordings used for transcription
feature news programs focusing principally on current events from the following sources:
Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Alam
News Channel, based in Iran; Alhurra, a U.S. government-funded regional broadcaster;
Aljazeera, a regional broadcaster located in Doha, Qatar; Dubai TV, a broadcast station
in the United Arab Emirates; Al Iraqiyah, an Iraqi television station; Kuwait TV,
a national broadcast station in Kuwait; Lebanese Broadcasting Corporation, a Lebanese
television station; Nile TV, a broadcast programmer based in Egypt; Saudi TV, a national
television station based in Saudi Arabia; and Syria TV, the national television station
in Syria. *Data* The transcript files are in plain-text, tab-delimited format (TDF)
with UTF-8 encoding, and the transcribed data totals 897,868 tokens. The transcripts
were created with the LDC-developed transcription tool, XTrans, a multi-platform,
multilingual, multi-channel transcription tool that supports manual transcription
and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)verbatim,
time-aligned transcripts plus speaker identification with minimal additional mark-up.
It does not include sentence unit annotation. QRTR annotation adds structural information
such as topic boundaries and manual sentence unit annotation to the core components
of a quick transcript. Files with QTR as part of the filename were developed using
QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636878
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 705-070-804-202-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast News Speech Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast News Speech Part 1 was developed by the Linguistic Data
Consortium (LDC) and is comprised of approximately 165 hours of Arabic broadcast news
speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat,
Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast News
Transcripts Part 1 (LDC2014T17). Broadcast audio for the GALE program was collected
at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong
Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia)
(Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast
collection supported GALE at a rate of approximately 300 hours per week of programming
from more than 50 broadcast sources for a total of over 30,000 hours of collected
broadcast audio over the life of the program. LDC’s local broadcast collection system
is highly automated, easily extensible and robust and capable of collecting, processing
and evaluating hundreds of hours of content from several dozen sources per day. The
broadcast material is served to the system by a set of free-to-air (FTA) satellite
receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast
satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between
receivers and recorders is dynamic and modular. All signal routing is performed under
computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high
bandwidth A/V format and are then processed to extract audio, to generate keyframes
and compressed audio/video, to produce time-synchronized closed captions (in the case
of North American English) and to generate automatic speech recognition (ASR) output.
An overview of the system, the sources recorded and the configuration of the recording
laboratory are contained in the Guidelines for Broadcast Audio Collection Version
3.0 included in this release. LDC designed a portable platform for remote broadcast
collection. This is a TiVO-style digital video recording (DVR) system that records
two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL)
and FTA DVB-S satellite programming and can operate outside of the United States.
It has a small footprint, weighs less than 30 pounds and can be transported as carry-on
luggage. Medianet collected Arabic programming from across the Gulf region using its
internal system and LDC's portable broadcast collection platform installed in 2008.
The portable platform deployed at the Medianet Tunisian collection facility collected
multiple streams of regional Arabic programming from various sources. MTC collected
Arabic programming using its internal collection system. *Data* The broadcast recordings
in this release feature news programs focusing principally on current events from
the following sources: Abu Dhabi TV, a televisions station based in Abu Dhabi, United
Arab Emirates; Al Alam News Channel, based in Iran; Alhurra, a U.S. government-funded
regional broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; Dubai
TV, a broadcast station in the United Arab Emirates; Al Iraqiyah, an Iraqi television
station; Kuwait TV, a national broadcast station in Kuwait; Lebanese Broadcasting
Corporation, a Lebanese television station; Nile TV, a broadcast programmer based
in Egypt; Saudi TV, a national television station based in Saudi Arabia; and Syria
TV, the national television station in Syria. This release contains 200 audio files
presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel
16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure
Specification Version 2.0 which is included in this release. The broadcast auditing
process served three principal goals: as a check on the operation of the broadcast
collection system equipment by identifying failed, incomplete or faulty recordings;
as an indicator of broadcast schedule changes by identifying instances when the incorrect
program was recorded; and as a guide for data selection by retaining information about
a program’s genre, data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636886
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 600-375-253-846-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ACE 2007 Multilingual Training Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ACE 2007 Multilingual Training Corpus was developed by the Linguistic Data Consortium
(LDC) and contains the complete set of Arabic and Spanish training data for the 2007
Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and
Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions.
The objective of the ACE program was to develop automatic content extraction technology
to support automatic processing of human language in text form from a variety of sources
including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants
were tested on system performance for the recognition of entities, values, temporal
expressions, relations, and events in Chinese and English and for the recognition
of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE
program is described in more detail on the LDC ACE project pages. The LDC Catalog
contains a series of publications from the ACE project and from researchers building
on that work. Among them are: * ACE-2 Version 1.0 (LDC2003T11) * TIDES Extraction
(ACE) 2003 Multilingual Training Data (LDC2004T09) * ACE Time Normalization (TERN)
2004 English Training Data v 1.0 (LDC2005T07) * ACE 2004 Multilingual Training Corpus
(LDC2005T09) * ACE 2005 Multilingual Training Corpus (LDC2006T06) * ACE 2005 English
SpatialML Annotations (LDC2008T03) * ACE 2005 Mandarin SpatialML Annotations (LDC2010T09)
* ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 (LDC2010T18) * ACE
2005 English SpatialML Annotations Version 2 (LDC2011T02) * Datasets for Generic Relation
Extraction (reACE) (LDC2011T08) *Data* The Arabic data is composed of newswire (60%)
published in October 2000-December 2000 and weblogs (40%) published during the period
November 2004-February 2005. The Spanish data set consists entirely of newswire material
from multiple sources published in January 2005-April 2005. Data selection was semi-automatic.
A document pool was established for each language based on genre and epoch requirements.
Humans reviewed the pool to select individual documents suitable for ACE annotation,
such as documents that were representative of their genre and contained targeted ACE
entity types. One annotator completed the entity and temporal expression (TIMEX2)
markup in the first pass annotation. This work was reviewed in the second pass by
a senior annotator. TIMEX2 values were normalized by an annotator specifically trained
for that task. The table below describes the amount of data included in the current
release and its annotation status. Corpus content for each language and data type
is represented in the three stages of annotation: first pass annotation (1P), second
pass annotation (2P) and TIMEX2 normalization and additional quality control (NORM).
Arabic Words Files 1P 2P NORM 1P 2P NORM NW 58,015 58,015 58,015 257 257 257 WL 40,338
40,338 40,338 121 121 121 Total 98,353 98,353 98,353 378 378 378 Spanish Words Files
1P 2P NORM 1P 2P NORM NW 100,401 100,401 100,401 352 352 352 Total 100,401 100,401
100,401 352 352 352 For a given document, there is a source .sgm file together with
the .ag.xml and .apf.xml annotation files in each of the three directories "1p", "2p"
and "timex2norm". In other words, for each newswire story or weblog entry, the three
annotation directories each contain an identical copy of the source text (SGML .sgm
file) along with distinct versions of the associated annotations (XML .ag.xml, apf.xml
files and plain text .tab files). Note that in many cases, two annotation stages have
produced identical output for a given source text, if no changes were made in the
latter stage. All files are presented in UTF-8
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish and Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636894
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 648-015-860-144-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Newswire Parallel Text Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Newswire Parallel Text Part 2 was developed by the Linguistic
Data Consortium (LDC). Along with other corpora, the parallel text in this release
comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains 117,895 tokens of Chinese source text
and corresponding English translations selected from newswire data collected by LDC
in 2007 and translated by LDC or under its direction. LDC has also released GALE Phase
2 Chinese Newswire Parallel Text Part 1 (LDC2014T15). *Data* This release includes
177 source-translation document pairs, comprising 117,895 tokens of translated data.
Data is drawn from four distinct Chinese newswire sources: China News Service, Guangming
Daily, People's Daily and People's Liberation Army Daily. Data was manually selected
for translation according to several criteria, including linguistic features and topic
features. The files were formatted into a human-readable translation format and assigned
to translation vendors. Translators followed LDC's Chinese to English translation
guidelines. Bilingual LDC staff performed quality control procedures on the completed
translations. Source data and translations are distributed in TDF format. TDF files
are tab-delimited files containing one segment of text along with meta information
about that segment. Each field in the TDF file is described in TDF_format.text. All
data are encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636908
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 907-876-388-540-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Arabic-English Word Alignment -- Broadcast Training Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Arabic-English Word Alignment -- Broadcast Training Part 1 was developed by the
Linguistic Data Consortium (LDC) and contains 267,257 tokens of word aligned Arabic
and English parallel text enriched with linguistic tags. This material was used as
training data in the DARPA GALE (Global Autonomous Language Exploitation) program.
Some approaches to statistical machine translation include the incorporation of linguistic
knowledge in word aligned text as a means to improve automatic word alignment and
machine translation quality. This is accomplished with two annotation schemes: alignment
and tagging. Alignment identifies minimum translation units and translation relations
by using minimum-match and attachment annotation approaches. A set of word tags and
alignment link tags are designed in the tagging scheme to describe these translation
units and relations. Tagging adds contextual, syntactic and language-specific features
to the alignment annotation. Other releases available in this series are: * GALE Chinese-English
Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16) * GALE
Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20)
* GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24)
* GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (LDC2013T05)
* GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 (LDC2013T23)
* GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web (LDC2014T05)
* GALE Arabic-English Word Alignment Training Part 2 -- Newswire (LDC2014T10) * GALE
Arabic-English Word Alignment Training Part 3 -- Web (LDC2014T14) *Data* This release
consists of Arabic source broadcast news and broadcast conversation data collected
by LDC from 2007-2009. The distribution by genre, words, tokens and segments appears
below: LanguageGenreFilesWordsTokensSegments Arabic BC 231 79,485 103,816 4,114 Arabic
BN 92 131,789 163,441 7,227 Totals 323 211,274 267,257 11,341 Note that word count
is based on the untokenized Arabic source, and token count is based on the tokenized
Arabic source. The Arabic word alignment tasks consisted of the following components:
* Normalizing tokenized tokens as needed * Identifying different types of links *
Identifying sentence segments not suitable for annotation * Tagging unmatched words
attached to other words or phrases
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic, Arabic, and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636924
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 492-150-006-320-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Discourse Treebank 0.5
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Discourse Treebank 0.5 was developed at Brandeis University as part of the
Chinese Treebank Project and consists of approximately 73,000 words of Chinese newswire
text annotated for discourse relations. It follows the lexically grounded approach
of the Penn Discourse Treebank (PDTB) (LDC2008T05) with adaptations based on the linguistic
and statistical characteristics of Chinese text. Discourse relations are lexically
anchored by discourse connectives (e.g., because, but, therefore), which are viewed
as predicates that take abstract objects such as propositions, events and states as
their arguments. Along with PDTB-style schemes for English, Turkish, Hindi and Czech,
Chinese Discourse Treebank provides an additional perspective on how the PDTB approach
can be extended for cross-lingual annotation of discourse relations. *Data* Data was
selected from the newswire material in Chinese Treebank 8.0 (LDC2013T21), specifically,
from Xinhua News Agency stories. There are approximately 5,500 annotation instances.
Following the PDTB format, each annotation instance consists of 27 vertical bar delimited
fields. The fields specify the attributes of the discourse relation as a whole, as
well as the attributes of its two arguments. Not all fields are filled in this release.
Filled fields are indicated by a pair of angle brackets; the remaining fields are
place holders for future releases.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zhou, Yuping
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zhang, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636932
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014S08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 527-011-778-815-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
fre
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
fra
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
United Nations Proceedings Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
United Nations Proceedings Speech was developed by the United Nations (UN) and contains
approximately 8,500 hours of recorded proceedings in the six official UN languages,
Arabic, Chinese, English, French, Russian and Spanish. The data was recorded in 2009-2012
from sessions 64-66 of the General Assembly (GA) and First Committee (FC) (Disarmament
and International Security), and meetings 6434-6763 of the Security Council. Recordings
were made using a customized system following a daily internal circulated instruction
from the Meetings Management Section. Most of the subjects and information related
to a particular meeting or session are published in a UN Journal which can be found
in the following link: http://www.un.org/en/documents/journal.asp *Data* Data is presented
either as mp3 or flac compressed wav and are 16-bit single channel files in either
22,050 or 8,000 Hz organized by committee and session number, then language. The folder
labeled "Floor" indicates the microphone used by the particular speaker. Those files
may include other languages, for instance, if the speaker's language was not among
the six official UN languages. File naming conventions for GA and FC data are in the
form of LYY_ZZ_format.format and Security Council data is in the form of LYYYY_ZZ_format.format
where L is a one letter language designation, YY is the meeting number, ZZ indicates
the audio segment number and format.format is the wav or mp3 designation. Note that
not all files are present for every language.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, Standard Arabic, French, Russian, and Spanish.
Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chay, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Elizalde, Cecilia
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ziemski, Michal
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636916
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 901-500-716-588-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Arabic-English Word Alignment -- Broadcast Training Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Arabic-English Word Alignment -- Broadcast Training Part 2 was developed by the
Linguistic Data Consortium (LDC) and contains 215,923 tokens of word aligned Arabic
and English parallel text enriched with linguistic tags. This material was used as
training data in the DARPA GALE (Global Autonomous Language Exploitation) program.
Some approaches to statistical machine translation include the incorporation of linguistic
knowledge in word aligned text as a means to improve automatic word alignment and
machine translation quality. This is accomplished with two annotation schemes: alignment
and tagging. Alignment identifies minimum translation units and translation relations
by using minimum-match and attachment annotation approaches. A set of word tags and
alignment link tags are designed in the tagging scheme to describe these translation
units and relations. Tagging adds contextual, syntactic and language-specific features
to the alignment annotation. Other releases available in this series are: * GALE Chinese-English
Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16) * GALE
Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20)
* GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24)
* GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (LDC2013T05)
* GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 (LDC2013T23)
* GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web (LDC2014T05)
* GALE Arabic-English Word Alignment Training Part 2 -- Newswire (LDC2014T10) * GALE
Arabic-English Word Alignment Training Part 3 -- Web (LDC2014T14) * GALE Arabic-English
Word Alignment -- Broadcast Training Part 1 (LDC2014T19) *Data* This release consists
of Arabic source broadcast news and broadcast conversation data collected by LDC from
2007-2009. The distribution by genre, words, tokens and segments appears below: LanguageGenreFilesWordsTokensSegments
Arabic BC 369 97,514 129,233 7,941 Arabic BN 40 70,635 86,400 3,752 Totals 409 168,149
215,923 11,693 Note that word count is based on the untokenized Arabic source, and
token count is based on the tokenized Arabic source. The Arabic word alignment tasks
consisted of the following components: * Normalizing tokenized tokens as needed *
Identifying different types of links * Identifying sentence segments not suitable
for annotation * Tagging unmatched words attached to other words or phrases
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic, Arabic, and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636940
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 221-795-248-256-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Fisher and CALLHOME Spanish--English Speech Translation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Fisher and CALLHOME Spanish-English Speech Translation was developed at Johns Hopkins
University and contains English reference translations and speech recognizer output
(in various forms) that complement the LDC Fisher Spanish (LDC2010T04) and CALLHOME
Spanish audio and transcript releases (LDC96T17). Together, they make a four-way parallel
text dataset representing approximately 38 hours of speech, with defined training,
development, and held-out test sets. *Data* The source data are the Fisher Spanish
and CALLOME Spanish corpora developed by LDC, comprising transcribed telephone conversations
between (mostly native) Spanish speakers in a variety of dialects. The Fisher Spanish
data set consists of 819 transcribed conversations on an assortment of provided topics
primarily between strangers, resulting in approximately 160 hours of speech aligned
at the utterance level, with 1.5 million tokens. The CALLHOME Spanish corpus comprises
120 transcripts of spontaneous conversations primarily between friends and family
members, resulting in approximately 20 hours of speech aligned at the utterance level,
with just over 200,000 words (tokens) of transcribed text. Translations were obtained
by crowdsourcing using Amazon's Mechanical Turk, after which the data was split into
training, development, and test sets. The CALLHOME data set defines its own data splits,
organized into train, devtest, and evltest, which were retained here. For the Fisher
material, four data splits were produced: a large training section and three test
sets. These test sets correspond to portions of the data where four translations exist.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Post, Matt
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kumar, Gaurav
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lopez, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Karakos, Damianos
ADDED ENTRY--PERSONAL NAME
- Personal name:
Callison-Burch, Chris
ADDED ENTRY--PERSONAL NAME
- Personal name:
Khudanpur, Sanjeev
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636959
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 974-370-635-113-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Boulder Lies and Truth
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Boulder Lies and Truth was developed at the University of Colorado Boulder and contains
approximately 1,500 elicited English reviews of hotels and electronics for the purpose
of studying deception in written language. Reviews were collected by crowd-sourcing
with Amazon Medical Turk. Each review was required to be original and was checked
for plagiarism against the web. Reviews were annotated with respect to the following
three dimensions: * Domain: Electronics (e.g., iPhone) or Hotels * Sentiment: Positive
or Negative * Truth Value: * a) Truthful: a review about an object known by the writer
reflecting the real sentiment of the writer toward the object of the review * b) Opposition:
A review about an object known by the writer reflecting the opposite sentiment of
the writer toward the object of the review (i.e., if the writer liked the object they
were asked to write a negative review; if the writer did not like the object, they
were asked to write a positive review) * c) Deceptive (i.e., fabricated): a review
written about an object not known by the writer either positive or negative in sentiment;
the objects reviewed were provided via a URL from the tasks in (a) and (b) *Data*
Each review was judged a total of 30 times: (1) 10 times to evaluate its perceived
quality (on a range from 1-5); (2) 10 times with judgments about its perceived truthfulness
(e.g., truthful or somehow deceptive, a lie or a fabrication); and (3) 10 times for
its perceived sentiment (i.e., star rating). The following metadata is available for
each review: * time consumed by the writer to write the review * a pair review ID
coupling the two reviews (positive/negative) written about the same object by the
same person, either false or truthful * the ID of the writer who wrote the review
* the writer's disclosure as to whether the object to be reviewed was already used
and/or known to the writer * the URL identifying an instance of the object (i.e.,
hotel or electronic product) on the web * a flag for plagiarized reviews * a marker
for reviews that may be removed from the corpus * the reasons for rejecting a review
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Salvetti, Franco
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636967
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 445-017-231-229-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 was developed
by the Linguistic Data Consortium (LDC) and contains 65,069 tokens of word aligned
Chinese and English parallel text enriched with linguistic tags. This material was
used as training data in the DARPA GALE (Global Autonomous Language Exploitation)
program. Some approaches to statistical machine translation include the incorporation
of linguistic knowledge in word aligned text as a means to improve automatic word
alignment and machine translation quality. This is accomplished with two annotation
schemes: alignment and tagging. Alignment identifies minimum translation units and
translation relations by using minimum-match and attachment annotation approaches.
A set of word tags and alignment link tags are designed in the tagging scheme to describe
these translation units and relations. Tagging adds contextual, syntactic and language-specific
features to the alignment annotation. Other releases available in this series are:
* GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and
Web (LDC2012T16) * GALE Chinese-English Word Alignment and Tagging Training Part 2
-- Newswire (LDC2012T20) * GALE Chinese-English Word Alignment and Tagging Training
Part 3 -- Web (LDC2012T24) * GALE Chinese-English Word Alignment and Tagging Training
Part 4 -- Web (LDC2013T05) * GALE Chinese-English Word Alignment and Tagging -- Broadcast
Training Part 1 (LDC2013T23) *Data* This release consists of Chinese source broadcast
conversation (BC) programming collected by LDC in 2008. The distribution by genre,
words, character tokens and segments appears below: Language Genre Docs Words CharTokens
Segments Chinese BC 9 43,379 65,069 2,419 Note that all token counts are based on
the Chinese data only. One token is equivalent to one character and one word is equivalent
to 1.5 characters. The Chinese word alignment tasks consisted of the following components:
* Identifying, aligning, and tagging eight different types of links * Identifying,
attaching, and tagging local-level unmatched words * Identifying and tagging sentence/discourse-level
unmatched words * Identifying and tagging all instances of Chinese 的(DE) except when
they were a part of a semantic link
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636975
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T26
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 553-178-294-380-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Chinese Web Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T26
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Chinese Web Parallel Text was developed by the Linguistic Data Consortium
(LDC). Along with other corpora, the parallel text in this release comprised training
data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
This corpus contains Chinese source text and corresponding English translations selected
from weblog and newsgroup data collected by LDC and translated by LDC or under its
direction. *Data* This release includes 46 source-translation document pairs, comprising
66,779 tokens of translated data. Data is drawn from four Chinese weblog and newsgroup
sources. Data was manually selected for translation according to several criteria,
including linguistic features and topic features. The files were formatted into a
human-readable translation format and assigned to translation vendors. Translators
followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed
quality control procedures on the completed translations. Source data and translations
are distributed in TDF format. TDF files are tab-delimited files containing one segment
of text along with meta information about that segment. Each field in the TDF file
is described in TDF_format.text. All data are encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Friedman, Lauren
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jin, Hubert
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T26
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636983
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T27
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 911-510-844-212-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Benchmarks for Open Relation Extraction
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T27
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Benchmarks for Open Relation Extraction was developed by the University of Alberta
and contains annotations for approximately 14,000 sentences from The New York Times
Annotated Corpus (LDC2008T19) and Treebank-3 (LDC99T42). This corpus was designed
to contain benchmarks for the task of open relation extraction (ORE), along with sample
extractions from ORE methods and evaluation scripts for computing a method's precision
and recall. ORE attempts to extract as many relations as described in a corpus without
relying on relation-specific training data. The traditional approach to relation extraction
requires substantial training effort for each relation of interest. That can be unpractical
for massive collections such as found on the web. Open relation extraction offers
an alternative by extracting unseen relations as they come. It does not require training
data for any particular relation, making it suitable for applications that require
a large (or even unknown) number of relations. Results published in ORE literature
are often not comparable due to the lack of reusable annotations and differences in
evaluation methodology. The goal of this benchmark data set is to provide annotations
that are flexible and can be used to evaluate a wide range of methods. *Data* Binary
and n-ary relations were extracted from the text sources. Sentences were annotated
for binary relations manually and automatically. In the manual sentence annotation,
two entities and a trigger (a single token indicating a relation) were identified
for the relation between them, if one existed. A window of tokens allowed to be in
a relation was specified; that included modifiers of the trigger and prepositions
connecting triggers to their arguments. For each sentence annotated with two entities,
a system must extract a string representing the relation between them. The evaluation
method deemed an extraction as correct if it contained the trigger and allowed tokens
only. The automatic annotator identified pairs of entities and a trigger of the relation
between them; the evaluation script for that experiment deemed an extraction correct
if it contained the annotated trigger. For n-ary relations, sentences were annotated
with one relation trigger and all of its arguments. An extracted argument was deemed
correct if it was annotated in the sentence. This release also includes extractions
from the following ORE methods: ReVerb, SONEX, OLLIE, PATTY, TreeKernel, SwiRL, Lund
and EXEMPLAR. Evaluation scripts are also provided for computing a method's precision
and recall.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mesquita, Filipe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schmidek, Jordan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Barbosa, Denilson
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T27
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585636991
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014S09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 248-931-036-356-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Chinese Broadcast Conversation Speech Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014S09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Chinese Broadcast Conversation Speech Part 1 was developed by the Linguistic
Data Consortium (LDC) and is comprised of approximately 126 hours of Mandarin Chinese
broadcast conversation speech collected in 2007 by LDC and Hong University of Science
and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. Corresponding transcripts are released as GALE Phase
3 Chinese Broadcast Conversation Transcripts Part 1 (LDC2014T28). Broadcast audio
for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at
three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic),
and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection
supported GALE at a rate of approximately 300 hours per week of programming from more
than 50 broadcast sources for a total of over 30,000 hours of collected broadcast
audio over the life of the program. LDC’s local broadcast collection system is highly
automated, easily extensible and robust and capable of collecting, processing and
evaluating hundreds of hours of content from several dozen sources per day. The broadcast
material is served to the system by a set of free-to-air (FTA) satellite receivers,
commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite
(DBS) receivers, and cable television (CATV) feeds. The mapping between receivers
and recorders is dynamic and modular. All signal routing is performed under computer
control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth
A/V format and are then processed to extract audio, to generate keyframes and compressed
audio/video, to produce time-synchronized closed captions (in the case of North American
English) and to generate automatic speech recognition (ASR) output. An overview of
the system, the sources recorded and the configuration of the recording laboratory
are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included
in this release. LDC designed a portable platform for remote broadcast collection.
This is a TiVO-style digital video recording (DVR) system that records two streams
of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S
satellite programming and can operate outside of the United States. It has a small
footprint, weighs less than 30 pounds and can be transported as carry-on luggage.
HKUST collected Chinese broadcast programming using its internal recording system
and a portable broadcast collection platform designed by LDC and installed at HKUST
in 2006. *Data* The broadcast conversation recordings in this release feature interviews,
call-in programs, and roundtable discussions focusing principally on current events
from the following sources: Anhui TV, a regional television station in Anhui Province,
China; Beijing TV, a national television station in China; China Central TV (CCTV),
a Chinese national and international broadcaster; Hubei TV, a regional broadcaster
in Hubei Province, China; and Phoenix TV, a Hong Kong-based satellite television station.
This release contains 217 audio files presented in FLAC-compressed Waveform Audio
File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by
a native Chinese speaker following Audit Procedure Specification Version 2.0 which
is included in this release. The broadcast auditing process served three principal
goals: as a check on the operation of the broadcast collection system equipment by
identifying failed, incomplete or faulty recordings, as an indicator of broadcast
schedule changes by identifying instances when the incorrect program was recorded,
and as a guide for data selection by retaining information about a program’s genre,
data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014S09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2014 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637009
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2014T28
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 795-890-089-376-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2014]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2014T28
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1 was developed by the
Linguistic Data Consortiume (LDC) and contains transcriptions of approximately 126
hours of Chinese broadcast conversation speech collected in 2007 by LDC and Hong University
of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global
Autonomous Language Exploitation) Program. Corresponding audio data is released as
GALE Phase 3 Chinese Broadcast Conversation Speech Part 1 (LDC2014S09). The source
broadcast conversation recordings feature interviews, call-in programs and roundtable
discussions focusing principally on current events from the following sources: Anhui
TV, a regional television station in Anhui Province, China; Beijing TV, a national
television station in China; China Central TV (CCTV), a Chinese national and international
broadcaster; Hubei TV, a regional television station in Hubei Province, China; and
Phoenix TV, a Hong Kong-based satellite television station. *Data* The transcript
files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed
data totals 1,556,904 tokens. The transcripts were created with the LDC-developed
transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription
tool that supports manual transcription and annotation of audio recordings. XTrans
is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans
. The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)
verbatim, time-aligned transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR annotation adds structural
information such as topic boundaries and manual sentence unit annotation to the core
components of a quick transcript. Files with QTR as part of the filename were developed
using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2014T28
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637017
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 969-347-223-333-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
cat
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
cat
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
SenSem (Sentence Semantics) Databank was developed by GRIAL, the Linguistic Applications
Inter-University Research Group that includes the following Spanish institutions:
the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat
de Lleida and the Universitat Oberta de Catalunya. It contains syntactic and semantic
annotation for over 35,000 sentences, approximately one million words of Spanish and
approximately 700,000 words of Catalan translated from the Spanish. GRIAL's work focuses
on resources for applied linguistics, including lexicography, translation and natural
language processing. Each sentence in SenSem Databank was labeled according to the
verb sense it exemplifies, the type of complement it takes (arguments or adjuncts)
and the syntactic category and function. Each argument was also labeled with a semantic
role. Further information about the SenSem project can be obtained from the GRIAL
website at http://grial.uab.es/sensem/corpus. *Data* The Spanish source data includes
texts from news journals (30,000 sentences) and novels (5,299 sentences). Those sentences
represent around 1,000 different verb meanings that correspond to the 250 most frequent
Spanish verbs. Verb frequencies were retrieved from a quantitative analysis of around
13 million words. The Catalan corpus was developed by translating the news journal
portion of the Spanish data set, resulting in a resource of over 700,000 sentences
from which 391,267 sentences were annotated. Sentences were automatically translated
and manually post-edited; some were re-annotated for sentence complements. Semantic
information was the same for both languages. The Catalan sentences represent close
to 1,300 different verbs. Data is presented in a single XML file per language.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish and Catalan. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fernández, Ana
ADDED ENTRY--PERSONAL NAME
- Personal name:
Vázquez, Gloria
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637025
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 686-404-990-337-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast News Transcripts Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast News Transcripts Part 2 was developed by the Linguistic
Data Consortium (LDC) and contains transcriptions of approximately 170 hours of Arabic
broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC,
Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation)
program. Corresponding audio data is released as GALE Phase 2 Arabic Broadcast News
Speech Part 2 (LDC2015S01). The broadcast recordings used for transcription feature
news programs focusing principally on current events from the following sources: Abu
Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Alam News
Channel, based in Iran; Aljazeera , a regional broadcaster located in Doha, Qatar;
Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, based in Dubai, United
Arab Emirates; Al Iraqiyah, a television network based in Iraq; Kuwait TV, a national
television station based in Kuwait; Lebanese Broadcasting Corporation, a Lebanese
television station; Nile TV, a broadcast programmer based in Egypt; Saudi TV, a national
television station based in Saudi Arabia; and Syria TV, the national television station
in Syria. LDC also previously released GALE Phase 2 Arabic Broadcast News Transcripts
Part 1 (LDC2014T17). *Data* The transcript files are in plain-text, tab-delimited
format (TDF) with UTF-8 encoding, and the transcribed data totals 920,730 tokens.
The transcripts were created with the LDC-developed transcription tool, XTrans, a
multi-platform, multilingual, multi-channel transcription tool that supports manual
transcription and annotation of audio recordings. XTrans is available from the following
link, https://www.ldc.upenn.edu/language-resources/tools/xtrans. The files in this
corpus were transcribed by LDC staff and/or by transcription vendors under contract
to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick
rich transcription specification (QRTR) both of which are included in the documentation
with this release. QTR transcription consists of quick (near-)verbatim, time-aligned
transcripts plus speaker identification with minimal additional mark-up. It does not
include sentence unit annotation. QRTR annotation adds structural information such
as topic boundaries and manual sentence unit annotation to the core components of
a quick transcript. Files with QTR as part of the filename were developed using QTR
transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637033
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 598-796-027-937-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 2 Arabic Broadcast News Speech Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 2 Arabic Broadcast News Speech Part 2 was developed by the Linguistic Data
Consortium (LDC) and is comprised of approximately 170 hours of Arabic broadcast news
speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco
during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast News Transcripts
Part 1 (LDC2015T01). LDC also released GALE Phase 2 Arabic Broadcast News Speech Part
1 (LDC2014S07). Broadcast audio for the GALE program was collected at LDC’s Philadelphia,
PA USA facilities and at three remote collection sites: Hong Kong University of Science
and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat,
Morocco) (Arabic). The combined local and outsourced broadcast collection supported
GALE at a rate of approximately 300 hours per week of programming from more than 50
broadcast sources for a total of over 30,000 hours of collected broadcast audio over
the life of the program. LDC’s local broadcast collection system is highly automated,
easily extensible and robust and capable of collecting, processing and evaluating
hundreds of hours of content from several dozen sources per day. The broadcast material
is served to the system by a set of free-to-air (FTA) satellite receivers, commercial
direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers,
and cable television (CATV) feeds. The mapping between receivers and recorders is
dynamic and modular. All signal routing is performed under computer control, using
a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and
are then processed to extract audio, to generate keyframes and compressed audio/video,
to produce time-synchronized closed captions (in the case of North American English)
and to generate automatic speech recognition (ASR) output. An overview of the system,
the sources recorded and the configuration of the recording laboratory are contained
in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.
LDC designed a portable platform for remote broadcast collection. This is a TiVO-style
digital video recording (DVR) system that records two streams of A/V material simultaneously.
It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can
operate outside of the United States. It has a small footprint, weighs less than 30
pounds and can be transported as carry-on luggage. Medianet collected Arabic programming
from across the Gulf region using its internal system and LDC's portable broadcast
collection platform installed in 2008. The portable platform deployed at the Medianet
Tunisian collection facility collected multiple streams of regional Arabic programming
from various sources. MTC collected Arabic programming using its internal collection
system. *Data* The broadcast recordings in this release feature news programs focusing
principally on current events from the following sources: Abu Dhabi TV, a television
station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran;
Aljazeera , a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national
broadcast station in Jordan; Dubai TV, based in Dubai, United Arab Emirates; Al Iraqiyah,
a television network based in Iraq; Kuwait TV, a national television station based
in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station; Nile
TV, a broadcast programmer based in Egypt; Saudi TV, a national television station
based in Saudi Arabia; and Syria TV, the national television station in Syria. This
release contains 204 audio files presented in FLAC-compressed Waveform Audio File
format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native
Arabic speaker following Audit Procedure Specification Version 2.0 which is included
in this release. The broadcast auditing process served three principal goals: as a
check on the operation of the broadcast collection system equipment by identifying
failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes
by identifying instances when the incorrect program was recorded; and as a guide for
data selection by retaining information about a program’s genre, data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637041
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 102-408-869-995-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Avocado Research Email Collection
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Avocado Research Email Collection consists of emails and attachments taken from 279
accounts of a defunct information technology company referred to as "Avocado". Most
of the accounts are those of Avocado employees; the remainder represent shared accounts
such as "Leads", or system accounts such as "Conference Room Upper Canada". The collection
consists of the processed personal folders of these accounts with metadata describing
folder structure, email characteristics and contacts, among others. It is expected
to be useful for social network analysis, e-discovery and related fields. *Data* The
source data for the collection consisted of Personal Storage Table (PST) files for
282 accounts. A PST file is used by MS Outlook to store emails, calendar entries,
contact details, and related information. Data was extracted from the PST files using
libpst version 0.6.54. Three files produced no output and and are not included in
the collection. Each account is referred to as a "custodian" although some of the
accounts do not correspond to humans. The collection is divided into metadata and
text. The metadata is represented in XML, with a single top-level XML file listing
the custodians, and then one XML file per custodian listing all items extracted from
that custodian's PST files. The full XML tree can be read by loading the top-level
file with an XML parser that handles directives. All XML metadata files are encoded
in UTF-8. The text contains the extracted text of the items in the custodians' folders,
with the extracted text for each item being held in a separate file. The text files
are then zipped into a zip file per custodian. *Licensing* Users are required to sign
two license agreements in order to access this corpus, the Avocado Collection Organizational
License Agreement and the Avocado Collection End User Agreement. Those agreements
can be viewed in the License field of this catalog entry.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Oard, Douglas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Webber, William
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kirsch, David A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Golitsynskiy, Sergey
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 919-947-994-062-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 was developed
by the Linguistic Data Consortium (LDC) and contains 242,020 tokens of word aligned
Chinese and English parallel text enriched with linguistic tags. This material was
used as training data in the DARPA GALE (Global Autonomous Language Exploitation)
program. Some approaches to statistical machine translation include the incorporation
of linguistic knowledge in word aligned text as a means to improve automatic word
alignment and machine translation quality. This is accomplished with two annotation
schemes: alignment and tagging. Alignment identifies minimum translation units and
translation relations by using minimum-match and attachment annotation approaches.
A set of word tags and alignment link tags are designed in the tagging scheme to describe
these translation units and relations. Tagging adds contextual, syntactic and language-specific
features to the alignment annotation. Other releases available in this series are:
* GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and
Web (LDC2012T16) * GALE Chinese-English Word Alignment and Tagging Training Part 2
-- Newswire (LDC2012T20) * GALE Chinese-English Word Alignment and Tagging Training
Part 3 -- Web (LDC2012T24) * GALE Chinese-English Word Alignment and Tagging Training
Part 4 -- Web (LDC2013T05) * GALE Chinese-English Word Alignment and Tagging -- Broadcast
Training Part 1 (LDC2013T23) * GALE Chinese-English Word Alignment and Tagging --
Broadcast Training Part 2 (LDC2014T25) *Data* This release consists of Chinese source
broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in
2008 and 2009. The distribution by genre, words, character tokens and segments appears
below: Language Genre Files Words CharTokens Segments Chinese BC 92 67,354 101,032
2,714 Chinese BN 34 93,992 140,988 3,314 Total 126 161,346 242,020 6,028 Note that
all token counts are based on the Chinese data only. One token is equivalent to one
character and one word is equivalent to 1.5 characters. The Chinese word alignment
tasks consisted of the following components: * Identifying, aligning, and tagging
eight different types of links * Identifying, attaching, and tagging local-level unmatched
words * Identifying and tagging sentence/discourse-level unmatched words * Identifying
and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic
link
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u sem d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637068
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 648-931-730-249-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
sem
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
pus
- Language code of text/sound track or separate title:
urd
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ajp
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
pus
- Language code of text/sound track or separate title:
urd
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
RATS Speech Activity Detection
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
RATS Speech Activity Detection was developed by the Linguistic Data Consortium (LDC)
and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi,
Pashto, and Urdu conversational telephone speech with automatic and manual annotation
of speech segments. The corpus was created to provide training, development and initial
test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic
Transcription of Speech) program. The goal of the RATS program was to develop human
language technology systems capable of performing speech detection, language identification,
speaker identification and keyword spotting on the severely degraded audio signals
that are typical of various radio communication channels, especially those employing
various types of handheld portable transceiver systems. To support that goal, LDC
assembled a system for the transmission, reception and digital capture of audio data
that allowed a single source audio signal to be distributed and recorded over eight
distinct transceiver configurations simultaneously. Those configurations included
three frequencies -- high, very high and ultra high -- variously combined with amplitude
modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band
or wide-band frequency modulation. Annotations on the clear source audio signal, e.g.,
time boundaries for the duration of speech activity, were projected onto the corresponding
eight channels recorded from the radio receivers. *Data* The source audio consists
of conversational telephone speech recordings collected by LDC: (1) data collected
for the RATS program from Levantine Arabic, Farsi, Pashto and Urdu speakers; and (2)
material from the Fisher English (LDC2004S13, LDC2005S13), and Fisher Levantine Arabic
telephone studies (LDC2007S02), as well as from CALLFRIEND Farsi (LDC2014S01). Annotation
was performed in three steps. LDC's automatic speech activity detector was run against
the audio data to produce a speech segmentation for each file. Manual first pass annotation
was then performed as a quick correction of the automatic speech activity detection
output. Finally, in a manual second pass annotation step, annotators reviewed first
pass output and made adjustments to segments as needed. All audio files are presented
as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression
is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file
headers.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in South Levantine Arabic, North Levantine Arabic, English, Persian, Pushto,
and Urdu. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sessa, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jones, Karen
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637092
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 594-468-772-379-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Mandarin-English Code-Switching in South-East Asia
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological
University and Universiti Sains Malaysia in Singapore and Malaysia, respectively.
It is comprised of approximately 192 hours of Mandarin-English code-switching speech
from 156 speakers with associated transcripts. Code-switching refers to the practice
of shifting between languages or language varieties during conversation. This corpus
focuses on the shift between Mandarin and English by Malaysian and Singaporean speakers.
Speakers engaged in unscripted conversations and interviews. In the conversational
speech segments, two speakers conversed freely with each other. The interviews consisted
of questions from an interviewer and answers from an interviewee; only the interviewee's
speech was recorded. Topics discussed range from hobbies, friends, and daily activities.
*Data* The speakers were gender-balanced (49.7% female, 50.3% male) and between 19
and 33 years of age. Over 60% of the speakers were Singaporean; the rest were Malaysian.
The speech recordings were conducted in a quiet room using several microphones and
recording devices. Details about the recording conditions are contained in the documentation
provided with this release. The audio files in this corpus are 16KHz, 16-bit recordings
in flac compressed wav format between 20 and 120 minutes in length. Selected segments
of the audio recordings were transcribed. Most of those segments contain code-switching
utterances. The transcription file for each audio file is stored in UTF-8 tab-separated
text file format.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Nanyang Technological University
ADDED ENTRY--PERSONAL NAME
- Personal name:
Universiti Sains Malaysia
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637076
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 140-461-846-522-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text was developed by the
Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this
release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. This corpus contains Modern Standard Arabic source
text and corresponding English translations selected from broadcast conversation data
collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under
its direction. LDC has also released the following GALE Phase 1 & 2 Arabic Parallel
Text data sets: * GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24)
* GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (LDC2008T09) * GALE Phase
1 Arabic Blog Parallel Text (LDC2008T02) * GALE Phase 1 Arabic Newsgroup Parallel
Text - Part 1 (LDC2009T03) * GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
(LDC2009T09) * GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (LDC2012T06)
* GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (LDC2012T14) * GALE
Phase 2 Arabic Newswire Parallel Text (LDC2012T17) * GALE Phase 2 Arabic Broadcast
News Parallel Text (LDC2012T18) * GALE Phase 2 Arabic Web Parallel Text (LDC2013T01)
*Data* GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text includes 55
source-translation document pairs, comprising 280,535 words of Arabic source text
and its English translation. Data is drawn from 22 distinct Arabic programs broadcast
between 2006 and 2008 from Al Alam News Channel, based in Iran; Al Arabiya, a news
television station based in Dubai; Al Baghdadya, an Iraqi broadcaster; Al Fayhaa,
a television channel in Iraq; Al Hiwar TV, based on London, United Kingdom; Aljazeera,
a regional broadcaster located in Doha, Qatar; Bahrain TV, based in the Kingdom of
Bahrain; Nile TV, a broadcast programmer based in Egypt; Oman TV, a national broadcaster
located in the Sultanate of Oman; Saudi TV, a national television station based in
Saudi Arabia; and Syria TV, the national television station in Syria. Broadcast conversation
programming is generally more interactive than traditional news broadcasts and includes
talk shows, interviews, call-in programs and roundtables. The files in this release
were transcribed by LDC staff and/or transcription vendors under contract to LDC in
accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the text. Data was manually
selected for translation according to several criteria, including linguistic features,
transcription features and topic features. The transcribed and segmented files were
then reformatted into a human-readable translation format and assigned to translation
vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual
LDC staff performed quality control procedures on the completed translations. Source
data and translations are distributed in TDF format. TDF files are tab-delimited files
containing one segment of text along with meta information about that segment. Each
field in the TDF file is described in TDF_format.txt. All data are encoded in UTF-8.
*Acknowledgement* This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication
does not necessarily reflect the position or the policy of the Government, and no
official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637084
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 041-146-278-187-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Chinese-English Parallel Aligned Treebank -- Training
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Chinese-English Parallel Aligned Treebank -- Training was developed by the Linguistic
Data Consortium (LDC) and contains 229,249 tokens of word aligned Chinese and English
parallel text with treebank annotations. This material was used as training data in
the DARPA GALE (Global Autonomous Language Exploitation) program. Parallel aligned
treebanks are treebanks annotated with morphological and syntactic structures aligned
at the sentence level and the sub-sentence level. Such data sets are useful for natural
language processing and related fields, including automatic word alignment system
training and evaluation, transfer-rule extraction, word sense disambiguation, translation
lexicon extraction and cultural heritage and cross-linguistic studies. With respect
to machine translation system development, parallel aligned treebanks may improve
system performance with enhanced syntactic parsers, better rules and knowledge about
language pairs and reduced word error rate. The Chinese source data was translated
into English. Chinese and English treebank annotations were performed independently.
The parallel texts were then word aligned. The material in this release corresponds
to portions of the Chinese treebanked data in Chinese Treebank 6.0 (LDC2007T36) (CTB),
OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0 (LDC2011T03). *Data* This release consists
of Chinese source broadcast programming (China Central TV, Phoenix TV), newswire (Xinhua
News Agency) and web data collected by LDC. The distribution by genre, words, character
tokens, treebank tokens and segments appears below: Genre Files Words CharTokens CTBTokens
Segments bc 10 57,571 86,356 60,270 3,328 nw 172 64,337 96,505 57,722 2,092 wb 86
30,925 46,388 31,240 1,321 Total 268 152,833 229,249 149,232 6,741 Note that all token
counts are based on the Chinese data only. One token is equivalent to one character
and one word is equivalent to 1.5 characters. The Chinese word alignment task consisted
of the following components: * Identifying, aligning, and tagging eight different
types of links * Identifying, attaching, and tagging local-level unmatched words *
Identifying and tagging sentence/discourse-level unmatched words * Identifying and
tagging all instances of Chinese 的 (DE) except when they were a part of a semantic
link This release contains nine types of files - Chinese raw source files, English
raw translation files, Chinese character tokenized files, Chinese CTB tokenized files,
English tokenized files, Chinese treebank files, English treebank files, character-based
word alignment files, and CTB-based word alignment files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, Chinese, and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitch
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taylor, Ann
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637106
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 567-512-470-543-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Mandarin Chinese Phonetic Segmentation and Tone
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Mandarin Chinese Phonetic Segmentation and Tone was developed by the Linguistic Data
Consortium (LDC) and contains 7,849 Mandarin Chinese "utterances" and their phonetic
segmentation and tone labels separated into training and test sets. The utterances
were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73
and LDC98T24, respectively). That collection consists of approximately 30 hours of
Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM,
a commercial radio station based in Los Angeles, CA. The ability to use large speech
corpora for research in phonetics, sociolinguistics and psychology, among other fields,
depends on the availability of phonetic segmentation and transcriptions. This corpus
was developed to investigate the use of phone boundary models on forced alignment
in Mandarin Chinese. Using the approach of embedded tone modeling (also used for incorporating
tones for automatic speech recognition), the performance on forced alignment between
tone-dependent and tone-independent models was compared. *Data* Utterances were considered
as the time-stamped between-pause units in the transcribed news recordings. Those
with background noise, music, unidentified speakers and accented speakers were excluded.
A test set was developed with 300 utterances randomly selected from six speakers (50
utterances for each speaker). The remaining 7,549 utterances formed a training set.
The utterances in the test set were manually labeled and segmented into initials and
finals in Pinyin, a Roman alphabet system for transcribing Chinese characters. Tones
were marked on the finals, including Tone1 through Tone4, and Tone0 for the neutral
tone. The Sandhi Tone3 was labeled as Tone2. The training set was automatically segmented
and transcribed using the LDC forced aligner, which is a Hidden Markov Model (HMM)
aligner trained on the same utterances (Yuan et al. 2014). The aligner achieved 93.1%
agreement (of phone boundaries) within 20 ms on the test set compared to manual segmentation.
The quality of the phonetic transcription and tone labels of the training set was
evaluated by checking 100 utterances randomly selected from it. The 100 utterances
contained 1,252 syllables: 15 syllables had mistaken tone transcriptions; two syllables
showed mistaken transcriptions of the final, and there were no syllables with transcription
errors on the initial. Each utterance has three associated files: a flac compressed
wav file, a word transcript file, and a phonetic boundaries and label file.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Yuan, Jiahong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ryant, Neville
ADDED ENTRY--PERSONAL NAME
- Personal name:
Liberman, Mark
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637114
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 022-705-777-770-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
The Subglottal Resonances Database
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
still image
- Content type code:
sti
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Subglottal Resonances Database was developed by Washington University and University
of California Los Angeles and consists of 45 hours of simultaneous microphone and
subglottal accelerometer recordings of 25 adult male and 25 adult female speakers
of American English between 22 and 25 years of age. The subglottal system is composed
of the airways of the tracheobronchial tree and the surrounding tissues. It powers
airflow through the larynx and vocal tract, allowing for the generation of most of
the sound sources used in languages around the world. The subglottal resonances (SGRs)
are the natural frequencies of the subglottal system. During speech, the subglottal
system is acoustically coupled to the vocal tract via the larynx. SGRs can be measured
from recordings of the vibration of the skin of the neck during phonation by an accelerometer,
much like speech formants are measured through microphone recordings. SGRs have received
attention in studies of speech production, perception and technology. They affect
voice production, divide vowels and consonants into discrete categories, affect vowel
perception and can be useful in automatic speech recognition. *Data* Speakers were
recruited by Washington University's Psychology Department. The majority of the participants
were Washington University students who represented a wide range of American English
dialects, although most were speakers of the mid-American English dialect. The corpus
consists of 35 monosyllables in a phonetically neutral carrier phrase (“I said a ____
again”), with 10 repetitions of each word by each speaker, resulting in 17,500 individual
microphone (and accelerometer) waveforms. The monosyllables were comprised of 14 hVd
words and 21 CVb words where C was b,d, g and V included all AE monophthongs and diphthongs.
The target vowel in each utterance was hand-labeled to indicate the start, stop, and
steady-state parts of the vowel. For diphthongs, the steady-state refers to the diphthong
nucleus which occurs early in the vowel. The height and age of each speaker is included
in the corpus metadata. Audio files are presented as single channel 16-bit flac compressed
wav files with sample rates of 48kHz or 16kHz. Image files are bitmap image files
and plain text is UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Pictures
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alwan, Abeer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lulich, Steven M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sommers, Mitchell S.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637122
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 622-675-389-924-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 and 4 Arabic Broadcast News Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 and 4 Arabic Broadcast News Parallel Text was developed by the Linguistic
Data Consortium (LDC). Along with other corpora, the parallel text in this release
comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Modern Standard Arabic source text and
corresponding English translations selected from broadcast conversation data collected
by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.
LDC has also released the below GALE Arabic Parallel Text: * GALE Phase 1 Arabic Broadcast
News Parallel Text - Part 1 (LDC2007T24) * GALE Phase 1 Arabic Broadcast News Parallel
Text - Part 2 (LDC2008T09) * GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02) *
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03) * GALE Phase 1 Arabic
Newsgroup Parallel Text - Part 2 (LDC2009T09) * GALE Phase 2 Arabic Broadcast Conversation
Parallel Text Part 1 (LDC2012T06) * GALE Phase 2 Arabic Broadcast Conversation Parallel
Text Part 2 (LDC2012T14) * GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17)
* GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18) * GALE Phase 2 Arabic
Web Parallel Text (LDC2013T01) * GALE Phase 3 and 4 Arabic Broadcast Conversation
Parallel Text (LDC2015T05) *Data* GALE Phase 3 and 4 Arabic Broadcast News Parallel
Text includes 86 source-translation document pairs, comprising 325,538 words of Arabic
source text and its English translation. Data is drawn from 28 distinct Arabic programs
broadcast between 2007 and 2008 from Abu Dhabi TV, a television station based in Abu
Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Al Arabiya, a news
television station based in Dubai; Al Baghdadya, an Iraqi broadcaster; Alhurra, a
U.S.-government funded regional broadcaster; Al Iraqiyah, an Iraqi television station;
Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national
broadcast station in Jordan; Al Sharqiya, an Iraqi broadcast programmer; Dubai TV,
a broadcast station in the United Arab Emirates; Kuwait TV, a national broadcast station
based in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station;
Oman TV, a national broadcaster located in the Sultanate of Oman; Radio Sawa, a U.S.-government
funded regional broadcaster; Saudi TV, a national television station based in Saudi
Arabia; and Syria TV, the national television station in Syria. Broadcast news programming
consists of news programs focusing principally on current events. The files in this
release were transcribed by LDC staff and/or transcription vendors under contract
to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC.
Transcribers indicated sentence boundaries in addition to transcribing the text. Data
was manually selected for translation according to several criteria, including linguistic
features, transcription features and topic features. The transcribed and segmented
files were then reformatted into a human-readable translation format and assigned
to translation vendors. Translators followed LDC's Arabic to English translation guidelines.
Bilingual LDC staff performed quality control procedures on the completed translations.
Source data and translations are distributed in TDF format. TDF files are tab-delimited
files containing one segment of text along with meta information about that segment.
Each field in the TDF file is described in TDF_format.txt. All data are encoded in
UTF-8. *Acknowledgement* This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication
does not necessarily reflect the position or the policy of the Government, and no
official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u cat d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637130
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015L01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 524-236-409-783-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
cat
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cat
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015L01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
SenSem (Sentence Semantics) Lexicons was developed by GRIAL, the Linguistic Applications
Inter-University Research Group that includes the following Spanish institutions:
the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat
de Lleida and the Universitat Oberta de Catalunya. It contains feature descriptions
for approximately 1,300 Spanish verbs and 1,300 Catalan verbs in the SenSem Databank
(LDC2015T02). GRIAL's work focuses on resources for applied linguistics, including
lexicography, translation and natural language processing. *Data* The verb features
for each language consist of two groups: those codified manually, including definition,
WordNet synset, Aktionsart, arguments and semantic functions; and those extracted
automatically from the SenSem Databank. Among the latter are verb frequency, semantic
construction, syntactic categories and constituent order. The verbs analyzed correspond
to the 250 most frequent verbs in Spanish and 320 lemmas in Catalan. Further information
about the SenSem project can be obtained from the GRIAL website at http://grial.uab.es/sensem/corpus.
Data is presented in a single XML file per language.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Catalan and Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fernández, Ana
ADDED ENTRY--PERSONAL NAME
- Personal name:
Vázquez, Gloria
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015L01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637149
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 060-785-139-403-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Coordination Annotation for the Penn Treebank
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Coordination Annotation for the Penn Treebank is a stand-off annotation for the Wall
Street Journal portion of Treebank-3 (PTB3) (LDC99T42) developed by researchers at
the University of Düsseldorf and Indiana University. It marks all tokens that have
a coordinating function (potentially among other functions). Coordination is a syntactic
structure that links together two or more elements known as conjuncts or conjoins.
The presence of coordination is often signaled by the appearance of a coordinator
(coordinating conjunction), such as and, or, but in English. Penn Coordination Annotation
is available at no cost to all licensees of PTB3 and appears in their download queue
associated with LDC99T42 as penn_coordination_anno_LDC2015T08.tgz. *Data* This annotation
is presented in a single UTF-8 plain text tsv file with columns as follows: * section:
Penn Treebank WSJ section number * file: Number of file within section * sentence:
Number of sentence (starting with 0) * token: Number of token (starting with 0) *
annotation: "P" if the token is a coordinating punctuation, "O" otherwise
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kübler, Sandra
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maier, Wolfgang
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hinrichs, Erhard
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637157
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 355-531-564-384-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 was developed by the
Linguistic Data Consortium (LDC) and contains transcriptions of approximately 112
hours of Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and
Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the
DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding audio
data is released as GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 (LDC2015S06).
Part 1 of this release is GALE Phase 3 Chinese Broadcast Conversation Transcripts
Part 1 (LDC2014T28). The corresponding part one audio is released as GALE Phase 3
Chinese Broadcast Conversation Speech Part 1 (LDC2014S09). The broadcast conversation
recordings feature interviews, call-in programs and roundtable discussions focusing
principally on current events from the following sources: Beijing TV, a national television
station in Mainland China; China Central TV, a national and international broadcaster
in Mainland China; Hubei TV, a regional television station in Mainland China, Hubei
Province; Phoenix TV, a Hong Kong-based satellite television station; and Voice of
America, a U.S. government-funded broadcast programmer. *Data* The transcript files
are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed
data totals 1,388,236 tokens. The transcripts were created with the LDC-developed
transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription
tool that supports manual transcription and annotation of audio recordings. XTrans
is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)
verbatim, time-aligned transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR annotation adds structural
information such as topic boundaries and manual sentence unit annotation to the core
components of a quick transcript. Files with QTR as part of the filename were developed
using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637165
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 319-611-553-017-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Chinese Broadcast Conversation Speech Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 was developed by the Linguistic
Data Consortium (LDC) and is comprised of approximately 112 hours of Mandarin Chinese
broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University
of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global
Autonomous Language Exploitation) Program. Corresponding transcripts are released
as GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 (LDC2015T09). Part
1 of this release is GALE Phase 3 Chinese Broadcast Conversation Speech Part 1 (LDC2014S09).
The corresponding part one transcripts are released as GALE Phase 3 Chinese Broadcast
Conversation Transcripts Part 1 (LDC2014T28). Broadcast audio for the GALE program
was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection
sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco)
(Arabic). The combined local and outsourced broadcast collection supported GALE at
a rate of approximately 300 hours per week of programming from more than 50 broadcast
sources for a total of over 30,000 hours of collected broadcast audio over the life
of the program. LDC’s local broadcast collection system is highly automated, easily
extensible and robust and capable of collecting, processing and evaluating hundreds
of hours of content from several dozen sources per day. The broadcast material is
served to the system by a set of free-to-air (FTA) satellite receivers, commercial
direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers,
and cable television (CATV) feeds. The mapping between receivers and recorders is
dynamic and modular. All signal routing is performed under computer control, using
a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and
are then processed to extract audio, to generate keyframes and compressed audio/video,
to produce time-synchronized closed captions (in the case of North American English)
and to generate automatic speech recognition (ASR) output. An overview of the system,
the sources recorded and the configuration of the recording laboratory are contained
in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.
LDC designed a portable platform for remote broadcast collection. This is a TiVO-style
digital video recording (DVR) system that records two streams of A/V material simultaneously.
It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can
operate outside of the United States. It has a small footprint, weighs less than 30
pounds and can be transported as carry-on luggage. HKUST collected Chinese broadcast
programming using its internal recording system and a portable broadcast collection
platform designed by LDC and installed at HKUST in 2006. *Data* The broadcast conversation
recordings in this release feature interviews, call-in programs, and roundtable discussions
focusing principally on current events from the following sources: Beijing TV, a national
television station in Mainland China; China Central TV, a national and international
broadcaster in Mainland China; Hubei TV, a regional television station in Mainland
China, Hubei Province; Phoenix TV, a Hong Kong-based satellite television station;
and Voice of America, a U.S. government-funded broadcast programmer. This release
contains 209 audio files presented in FLAC-compressed Waveform Audio File format (.flac),
16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker
following Audit Procedure Specification Version 2.0 which is included in this release.
The broadcast auditing process served three principal goals: as a check on the operation
of the broadcast collection system equipment by identifying failed, incomplete or
faulty recordings, as an indicator of broadcast schedule changes by identifying instances
when the incorrect program was recorded, and as a guide for data selection by retaining
information about a program’s genre, data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637203
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 838-468-581-053-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería
Eléctrica y Servicio Social) was developed by the Speech Processing Laboratory of
the Faculty of Engineering at the National Autonomous University of Mexico (UNAM)
and consists of approximately 18 hours of Mexican Spanish radio speech, associated
transcripts, pronouncing dictionaries and language models. The goal of this work was
to create acoustic models for automatic speech recognition. For more information and
documentation see the CIEMPIESS-UNAM Project website. LDC has also released an updated
version as CIEMPIESS LIGHT (LDC2017S23). *Data* The speech recordings are from 43
one-hour FM radio programs broadcast by Radio IUS, a UNAM radio station. They are
comprised of spontaneous conversations between a radio moderator and guests, principally
about legal issues. Approximately 78% of the speakers were males, and 22% of the speakers
were females. The audio was recorded in MP3 stereo format, using a 44.1 kHz sample
rate and a bit-rate of 128 kbps or higher. Only "clean" utterances were selected from
the raw data, meaning that the utterances were made by one only person with no background
noises, whispers, music, foreign accents, white noise or static. The audio files were
converted to 16 kHz, 16-bit PCM WAV format for this release. The recordings were transcibed
using PRAAT, a tool designed for phonetics research. The transcripts are in Mexbet,
a phonetic alphablet designed for Mexican Spanish based on Worldbet (Hieronymus, 1994).
Plain text transcripts, textgrid format time labels and files useful for performing
experiments with the SPHINX3 recognition software are also included.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mena, Carlos Daniel Hernández
ADDED ENTRY--PERSONAL NAME
- Personal name:
Herrera, Abel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 256-234-245-630-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
RST Signalling Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
RST Signalling Corpus was developed at Simon Fraser University and contains annotations
for signalling information added to RST Discourse Treebank (LDC2002T07). RST Discourse
Treebank (RST-DT) is a collection of English news texts annotated for rhetorical relations
under the RST (Rhetorical Structure Theory) framework. In RST Signalling Corpus, information
about textual signals -- such as although, because, thus -- and signals such as tense,
lexical chains or punctuation were added as an annotation layer to examine how rhetorical
relations are signalled in discourse. *Data* The source data consists of 385 Wall
Street Journal news articles from the Penn Treebank annotated for rhetorical relations
in RST Discourse Treebank. As in RST-DT, the data in this release is divided into
a training set (347 articles) and a test set (38 articles). The signalling annotation
in this data set was performed using the UAM CorpusTool version 2.8.12. Files are
presented as UTF-8 encoded XML and plain text. The corpus is divided into three annotation
sub-directories: training, test and full. All sub-directories include source, metadata,
signalling annotation, and dtd files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Das, Debopam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taboada, Maite
ADDED ENTRY--PERSONAL NAME
- Personal name:
McFetridge, Paul
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u bul d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637173
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 578-227-532-044-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
bul
- Language code of text/sound track or separate title:
dan
- Language code of text/sound track or separate title:
dut
- Language code of text/sound track or separate title:
ger
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
slv
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
swe
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
bul
- Language code of text/sound track or separate title:
dan
- Language code of text/sound track or separate title:
nld
- Language code of text/sound track or separate title:
deu
- Language code of text/sound track or separate title:
jpn
- Language code of text/sound track or separate title:
por
- Language code of text/sound track or separate title:
slv
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
swe
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2006 CoNLL Shared Task - Ten Languages
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages
used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The
languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese,
Portuguese, Slovene, Spanish, Swedish and Turkish. LDC also released the following
2006 & 2007 CoNLL Shared Task corpora: * 2007 CoNLL Shared Task - Basque, Catalan,
Czech & Turkish (LDC2018T06) * 2007 CoNLL Shared Task - Greek, Hungarian & Italian
(LDC2018T07) * 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish (LDC2018T06)
* 2006 CoNLL Shared Task - 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12) This
corpus is cross listed and jointly released with ELRA as ELRA-W0086. The Conference
on Computational Natural Language Learning (CoNLL) is accompanied every year by a
shared task intended to promote natural language processing applications and evaluate
them in a standard setting. In 2006, the shared task was devoted to the parsing of
syntactic dependencies using corpora from up to thirteen languages. The task aimed
to define and extend the then-current state of the art in dependency parsing, a technology
that complemented previous tasks by producing a different kind of syntactic description
of input text. More information about the 2006 shared task is available on the CoNLL-X
web page. LDC has released data sets from other CoNLL shared tasks. 2008 CoNLL Shared
Task Data contains the English material used in the 2008 shared task which focused
on English, employed a unified dependency-based formalism and merged the tasks of
syntactic dependency parsing, identifying semantic arguments and labeling them with
semantic roles. 2009 CoNLL Shared Task Data Parts 1 and 2 consists of the English,
Catalan, Chinese, Czech, German and Spanish resources used in the 2009 task which
included a comparison of time and space complexity based on participants' input and
learning curve comparison for languages with large datasets. LDC has also released
the following CoNLL Shared Task data sets: * 2006 CoNLL Shared Task - Arabic & Czech
(LDC2015T12) * 2008 CoNLL Shared Task Data (LDC2009T12) * 2009 CoNLL Shared Task Part
1 (LDC2012T03) * 2009 CoNLL Shared Task Part 2 (LDC2012T04) * 2015-2016 CoNLL Shared
Task (LDC2017T13) *Data* The source data in the treebanks in this release consists
principally of various texts (e.g., textbooks, news, literature) annotated in dependency
format. In general, dependency grammar is based on the idea that the verb is the center
of the clause structure and that other units in the sentence are connected to the
verb as directed links or dependencies. This is a one-to-one correspondence: for every
element in the sentence there is one node in the sentence structure that corresponds
to that element. In constituency or phrase structure grammars, on the other hand,
clauses are divided into noun phrases and verb phrases and in each sentence, one or
more nodes may correspond to one element. The Penn Treebank (LDC99T42) is an example
of a constituency or phrase structure approach. All of the data sets in this release
are dependency treebanks. The individual data sets are: * BulTreeBank (Bulgarian)
* The Danish Dependency Treebank (Danish) * The Alpino Treebank (Dutch) * The TIGER
Corpus (German) * Treebank Tuba-J/S (Japanese) * Floresta Sinta(c)tica (Portuguese)
* Slovene Dependency Treebank, SDT V0.1 (Slovene) * Cast3LB (Spanish) * Talbanken05
(Swedish) * METU-Sabanci Turkish Treebank (Turkish)
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovenian, Spanish,
Swedish, and Turkish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bulgarian Academy of Sciences
ADDED ENTRY--PERSONAL NAME
- Personal name:
Eberhard-Karls-Universität
ADDED ENTRY--PERSONAL NAME
- Personal name:
Copenhagen Business School
ADDED ENTRY--PERSONAL NAME
- Personal name:
Danish Society for Language and Literature
ADDED ENTRY--PERSONAL NAME
- Personal name:
University of Groningen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Universität Potsdam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Universität des Saarlandes
ADDED ENTRY--PERSONAL NAME
- Personal name:
Universität Stuttgart
ADDED ENTRY--PERSONAL NAME
- Personal name:
Eberhard-Karls-Universität Tübingen
ADDED ENTRY--PERSONAL NAME
- Personal name:
University of Southern Denmark
ADDED ENTRY--PERSONAL NAME
- Personal name:
SINTEF Telcom & Informatics
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jožef Stefan Institute
ADDED ENTRY--PERSONAL NAME
- Personal name:
Charles University
ADDED ENTRY--PERSONAL NAME
- Personal name:
The Fran Ramovš Institute for the Slovenian Language
ADDED ENTRY--PERSONAL NAME
- Personal name:
University of Barcelona
ADDED ENTRY--PERSONAL NAME
- Personal name:
Uppsala University
ADDED ENTRY--PERSONAL NAME
- Personal name:
Växjŏ University
ADDED ENTRY--PERSONAL NAME
- Personal name:
Middle East Technical University
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u cze d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637181
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 798-485-294-792-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
cze
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ces
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2006 CoNLL Shared Task - Arabic & Czech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2006 CoNLL Shared Task - Arabic & Czech consists of Arabic and Czech dependency treebanks
used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. LDC
also released the following 2006 & 2007 CoNLL Shared Task corpora: * 2007 CoNLL Shared
Task - Basque, Catalan, Czech & Turkish (LDC2018T06) * 2007 CoNLL Shared Task - Greek,
Hungarian & Italian (LDC2018T07) * 2007 CoNLL Shared Task - Basque, Catalan, Czech
& Turkish (LDC2018T06) * 2006 CoNLL Shared Task - Ten Languages (LDC2015T11) This
corpus is cross listed with ELRA as ELRA-W0087. The Conference on Computational Natural
Language Learning (CoNLL) is accompanied every year by a shared task intended to promote
natural language processing applications and evaluate them in a standard setting.
In 2006, the shared task was devoted to the parsing of syntactic dependencies using
corpora from up to thirteen languages. The task aimed to define and extend the then-current
state of the art in dependency parsing, a technology that complemented previous tasks
by producing a different kind of syntactic description of input text. More information
about the 2006 shared task is available on the CoNLL-X web page. LDC has released
data sets from other CoNLL shared tasks. 2008 CoNLL Shared Task Data contains the
English material used in the 2008 shared task which focused on English, employed a
unified dependency-based formalism and merged the tasks of syntactic dependency parsing,
identifying semantic arguments and labeling them with semantic roles. 2009 CoNLL Shared
Task Data Parts 1 and 2 consists of the English, Catalan, Chinese, Czech, German and
Spanish resources used in the 2009 task which included a comparison of time and space
complexity based on participants' input and learning curve comparison for languages
with large datasets. LDC has also released the following CoNLL Shared Task data sets:
* 2006 CoNLL Shared Task - Ten Languages (LDC2015T11) * 2008 CoNLL Shared Task Data
(LDC2009T12) * 2009 CoNLL Shared Task Part 1 (LDC2012T03) * 2009 CoNLL Shared Task
Part 2 (LDC2012T04) * 2015-2016 CoNLL Shared Task (LDC2017T13) *Data* The source data
in this release consists principally of news and journal texts. The individual data
sets are subsets of the following: * Prague Arabic Dependency Treebank (PADT) 1.0
* The Prague Dependency Treebank 1.0
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Czech and Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Charles University
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637246
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T13
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English News Text Treebank: Penn Treebank Revised
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
English News Text Treebank: Penn Treebank Revised was developed by the Linguistic
Data Consortium (LDC) with funding through a gift from Google Inc. It consists of
a combination of automated and manual revisions of the Penn Treebank annotation of
Wall Street Journal (WSJ) stories. The data is comprised of 1,203,648 word-level tokens
in 49,191 sentence-level tokens -- in all 2,312 of the original Penn Treebank WSJ
files. *Data* This release includes revised tokenization, part-of-speech, and syntactic
treebank annotation intended to bring the full WSJ treebank section into compliance
with the agreed-upon policies and updates implemented for current English treebank
annotation specifications at LDC. Examples include English Web Treebank (LDC2012T13),
OntoNotes (LDC2013T19), and English translation treebanks such as English Translation
Treebank: An-Nahar Newswire (LDC2012T02). English Treebank Supplemental Guidelines
are included in this release.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mott, Justin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Warner, Colin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637211
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 382-448-237-694-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences was developed by the
Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this
release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Chinese source sentences and corresponding
English translations selected from broadcast conversation data collected by LDC in
2008 and transcribed and translated by LDC or under its direction. *Data* GALE Phase
4 Chinese Broadcast Conversation Parallel Sentences includes 109 source-translation
document pairs, comprising 63,829 tokens of Chinese source text and its English translation.
Data is drawn from 17 distinct Chinese programs broadcast in 2008 from Beijing TV,
a national television station in Mainland China; China Central TV, a national and
international broadcaster in Mainland China; Hubei TV, a regional television station
in Mainland China, Hubei Province; and Voice of America, a U.S. government-funded
broadcast programmer. Broadcast conversation programming is more interactive than
traditional news broadcasts and includes talk shows, interviews, call-in programs
and roundtable discussions. The programs in this release focus on current events topics.
The data was transcribed by LDC staff and/or transcription vendors under contract
to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC.
Transcribers indicated sentence boundaries in addition to transcribing the text. Sentences
were selected for translation in two steps. First, files were chosen using sentence
selection scripts provided by GALE program participants SRI International and IBM.
The output was then manually reviewed by LDC staff to eliminate problematic sentences.
Selected files were reformatted into a human-readable translation format and assigned
to translation vendors. Translators followed LDC's Chinese to English translation
guidelines and were provided with the full source documents containing the target
sentences for their reference. Bilingual LDC staff performed quality control procedures
on the completed translations. Source data and translations are distributed in TDF
format. TDF files are tab-delimited files containing one segment of text along with
meta information about that segment. Each field in the TDF file is described in TDF_format.txt.
All data are encoded in UTF-8. *Acknowledgement* This work was supported in part by
the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or the policy
of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S08
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
The Walking Around Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The Walking Around Corpus was developed by Stony Brook University and is comprised
of approximately 33 hours of navigational telephone dialogues from 72 speakers (36
speaker pairs). Participants were Stony Brook University students who identified themselves
as native English speakers. This corpus was elicited using a navigation task in which
one person directed another to walk to 18 unique destinations on Stony Brook University’s
West campus. The direction-giver remained inside the lab and gave directions on a
landline telephone to the pedestrian who used a mobile phone. As they visited each
location, the pedestrians took a picture of each of the 18 destinations using the
mobile phone. Pairs conversed spontaneously as they completed the task. The pedestrians'
locations were tracked using their cell phones' GPS systems. The pedestrians did not
have any maps or pictures of the target destinations and therefore relied on the direction-giver's
verbal directions and descriptions to locate and photograph the target destinations.
*Data* The conversations were recorded by means of a Public Switched Telephone Network
(PSTN) conferencing service. Due to the nature of the task, the recordings contain
occasional background noise. Each digital audio file was transcribed with time stamps.
Most of the recordings were first transcribed by a transcription company and then
edited and checked by a trained graduate student. All other transcripts were transcribed
by trained students at Stony Brook University. The corpus material also includes the
visual materials (pictures and maps) used to elicit the dialogues, data about the
speakers' relationship, spatial abilities and memory performance, and other information.
All audio is presented as 8000Hz, 16-bit flac compressed wav. Note the data was converted
from wav, so some documentation may still indicate wav. Transcripts are presented
as xls spreadsheets.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brennan, Susan E.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schuhmann, Katharina S.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Batres, Karla M.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u tur d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637238
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T15
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TS Wikipedia is a collection of approximately 1.6 million processed Turkish Wikipedia
pages. The data is tokenized and includes part-of-speech tags, morphological analysis,
lemmas, bi-grams and tri-grams. *Data* The data is in a word-per-line format with
five tab-separated columns: token, part-of-speech tag, morphological analysis, lemma
and corrected token spelling if needed. All data is presented in UTF-8 XML files and
was selected and filtered to reduce non-Turkish characters, mathematical formulas
and non-Turkish entries.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Turkish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sezer, Taner
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sezer, Türker
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637254
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 147-667-579-524-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Chinese Broadcast News Parallel Sentences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Chinese Broadcast News Parallel Sentences was developed by the Linguistic
Data Consortium (LDC). Along with other corpora, the parallel text in this release
comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Chinese source sentences and corresponding
English translations selected from broadcast news data collected by LDC in 2008 and
transcribed and translated by LDC or under its direction. *Data* GALE Phase 4 Chinese
Broadcast News Parallel Sentences includes 40 source-translation document pairs, comprising
156,429 tokens of Chinese source text and its English translation. Data is drawn from
eight distinct Chinese programs broadcast in 2008 from China Central TV, a national
and international broadcaster in Mainland China; and Voice of America, a U.S. government-funded
broadcast programmer. The programs in this release feature news programs on current
events topics. The data was transcribed by LDC staff and/or transcription vendors
under contract to LDC in accordance with the Quick Rich Transcription guidelines developed
by LDC. Transcribers indicated sentence boundaries in addition to transcribing the
text. Sentences were selected for translation in two steps. First, files were chosen
using sentence selection scripts provided by GALE program participants SRI International
and IBM. The output was then manually reviewed by LDC staff to eliminate problematic
sentences. Selected files were reformatted into a human-readable translation format
and assigned to translation vendors. Translators followed LDC's Chinese to English
translation guidelines and were provided with the full source documents containing
the target sentences for their reference. Bilingual LDC staff performed quality control
procedures on the completed translations. Source data and translations are distributed
in TDF format. TDF files are tab-delimited files containing one segment of text along
with meta information about that segment. Each field in the TDF file is described
in TDF_format.txt. All data are encoded in UTF-8. *Acknowledgement* This work was
supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant
No. HR0011-06-1-0003. The content of this publication does not necessarily reflect
the position or the policy of the Government, and no official endorsement should be
inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637262
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 868-101-922-037-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
LDC Spoken Language Sampler - Third Release
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC (Linguistic Data Consortium) Spoken Language Sampler - Third Release contains
samples from 20 different corpora published by LDC between 1996 and 2015. LDC distributes
a wide and growing assortment of resources for researchers, engineers and educators
whose work is concerned with human languages. Historically, most linguistic resources
were not generally available to interested researchers but were restricted to single
laboratories or to a limited number of users. Inspired by the success of selected
readily-available and well-known data sets, such as the Brown University text corpus,
LDC was founded in 1992 to provide a new mechanism for large-scale corpus development
and resource sharing. With the support of its members, LDC provides critical services
to the language research community that include: maintaining the LDC data archives,
producing and distributing data via media or web download, negotiating intellectual
property agreements with potential information providers and maintaining relations
with other like-minded groups around the world. Resources available from LDC include
speech, text, video data and lexicons in multiple languages, as well as software tools
to facilitate the use of corpus materials. For a complete view of LDC's publications,
browse the Catalog. The sampler is available as a free download. *Data* The LDC Spoken
Language Sampler - Third Release provides speech and transcript samples and is designed
to illustrate the variety and breadth of the speech-related resources available from
the LDC Catalog. The sound files included in this release are excerpts that have been
modified in various ways relative to the original data as published by LDC: * Most
excerpts are truncated to be much shorter than the original files, typically between
1.5 and 2 minutes. * Signal amplitude has been adjusted where necessary to normalize
playback volume. * Some corpora are published in compressed form, but all samples
here are uncompressed. * Some text files are presented as images to ensure foreign
character sets display properly. * In some publications, NIST SPHERE file format is
used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file
format for compatibility with typical browser audio utilities. FLAC files have been
expanded into their wav form as well. The link for the catalog number takes you to
the catalog entry. LDC2014S06 2009 NIST Language Recognition Evaluation Test Set The
2009 evaluation contains approximately 215 hours of conversational telephone speech
and radio broadcast conversation collected by LDC in the following 23 languages and
dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American),
English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto,
Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese. LDC2014S01
CALLFRIEND Farsi Second Edition Speech CALLFRIEND Farsi Second Edition Speech was
developed by LDC and consists of approximately 42 hours of telephone conversation
(100 recordings) among native Farsi speakers. The CALLFRIEND project supported the
development of language identification technology. Each CALLFRIEND corpus consists
of unscripted telephone conversations lasting between 5-30 minutes. LDC96S37 CALLHOME
Japanese A corpus of 120 unscripted telephone conversations between native Japanese
speakers and a corpus of associated transcripts. LDC2013S09 CSC Deceptive Speech CSC
Deceptive Speech was developed by Columbia University, SRI International and University
of Colorado Boulder. It consists of 32 hours of audio interviews from 32 native speakers
of Standard American English (16 male, 16 female) recruited from the Columbia University
student population and the community. The purpose of the study was to distinguish
deceptive speech from non-deceptive speech using machine learning techniques on extracted
features from the corpus. LDC2007S18 CSLU Kids' Speech Developed at Oregon State University's
Center for Spoken Language Understanding, this corpus is a collection of spontaneous
and prompted speech from 1100 children from Kindergarten through Grade 10. LDC2010S01
Fisher Spanish Speech Fisher Spanish Speech consists of audio files covering roughly
163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean
Spanish speakers. LDC2014S02 King Saud University Arabic Speech Database King Saud
University Arabic Speech Database contains 590 hours of recorded Arabic speech from
269 male and female Saudi and non-Saudi speakers. The utterances include read and
spontaneous speech recorded in quiet and noisy environments. The recordings were collected
via different microphones and a mobile phone and averaged between 16-19 minutes. LDC2003S07
Korean Telephone Conversations Complete The Korean telephone conversations were originally
recorded as part of the CALLFRIEND project. Korean Telephone Conversations Speech
consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND
Korean, while the remaining 51 are previously unexposed calls. Korean Telephone Conversations
Transcripts (LDC2003T08) consists of 100 text files, totaling approximately 190K words
and 25K unique words. All files are in Korean orthography: orthographic Korean characters
are in Hangul, encoded in KSC5601 (Wansung) system. The complete data set also includes
a lexicon (LDC2003L02). LDC2012S04 Malto Speech and Transcripts Malto Speech and Transcripts
contains approximately 8 hours of Malto speech data collected between 2005 and 2009
from 27 speakers (22 males, 5 females). Also included are accompanying transcripts,
English translations and glosses for 6 hours of the collection. Malto is principally
spoken in northeastern India and Bangladesh. LDC2015S05 Mandarin Chinese Phonetic
Segmentation and Tone Mandarin Chinese Phonetic Segmentation and Tone was developed
by LDC and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation
and tone labels separated into training and test sets. The utterances were derived
from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24,
respectively). That collection consists of approximately 30 hours of Chinese broadcast
news recordings from Voice of America, China Central TV and KAZN-AM, a commercial
radio station based in Los Angeles, CA. This corpus was developed to investigate the
use of phone boundary models on forced alignment in Mandarin Chinese. LDC2015S04 Mandarin-English
Code-Switching in South-East Asia Mandarin-English Code-Switching in South-East Asia
was developed by Nanyang Technological University and Universiti Sains Malaysia and
includes approximately 192 hours of Mandarin-English code-switching speech from 156
speakers with associated transcripts. LDC2013S03 Mixer 6 Speech Mixer 6 Speech was
developed by LDC and is comprised of 15,863 hours of telephone speech, interviews
and transcript readings from 594 distinct native English speakers. This material was
collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase
6, the focus of which was on native American English speakers local to the Philadelphia
area. LDC2014S03 Multi-Channel WSJ Audio Multi-Channel WSJ Audio was developed by
the Centre for Speech Technology Research at The University of Edinburgh and contains
approximately 100 hours of recorded speech from 45 British English speakers. Participants
read Wall Street Journal texts published in 1987-1989 in three recording scenarios:
a single stationary speaker, two stationary overlapping speakers and one single moving
speaker. LDC2004S09 NIST Meeting Pilot Corpus Speech This data set contains speech
and transcriptions from topical discussions in meeting settings, including complete
descriptive metadata and detailed descriptions of the physical environment in which
the discussions took place. LDC2015S02 RATS Speech Activity Detection RATS Speech
Activity Detection was developed by LDC and is comprised of approximately 3,000 hours
of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech
with automatic and manual annotation of speech segments. The corpus was created to
provide training, development and initial test sets for the Speech Activity Detection
(SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program. LDC2015S03
The Subglottal Resonances Database The Subglottal Resonances Database was developed
by Washington University and University of California Los Angeles and consists of
45 hours of simultaneous microphone and subglottal accelerometer recordings of 25
adult male and 25 adult female speakers of American English between 22 and 25 years
of age. LDC2012S02 TORGO Database of Dysarthric Articulation TORGO contains approximately
23 hours of English speech data, accompanying transcripts and documentation from 8
speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis
and from 7 speakers (4 males, 3 females) from a non-dysarthric control group. LDC2012S06
Turkish Broadcast News Speech and Transcripts Turkish Broadcast News Speech and Transcripts
contains approximately 130 hours of Voice of America Turkish radio broadcasts and
corresponding transcripts. LDC2014S08 United Nations Proceedings Speech United Nations
Proceedings Speech was developed by the United Nations (UN) and contains approximately
8,500 hours of recorded proceedings in the six official UN languages, Arabic, Chinese,
English, French, Russian and Spanish. The data was recorded in 2009-2012 from sessions
64-66 of the General Assembly and First Committee (Disarmament and International Security),
and meetings 6434-6763 of the Security Council. LDC2014S04 USC-SFI MALACH Interviews
and Transcripts Czech USC-SFI MALACH Interviews and Transcripts Czech was developed
by The University of Southern California Shoah Foundation Institute (USC-SFI) and
the University of West Bohemia as part of the MALACH (Multilingual Access to Large
Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420
interviewees along with transcripts and other documentation.
LANGUAGE NOTE
- Language note:
Content in . Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637270
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 568-308-670-444-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Learner Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Learner Corpus was developed at the University of Leeds and consists of written
essays and spoken recordings by Arabic learners collected in Saudi Arabia in 2012
and 2013. The corpus includes 282,732 words in 1,585 materials, produced by 942 students
from 67 nationalities studying at pre-university and university levels. The average
length of an essay is 178 words. *Data* Two tasks were used to collect the written
data, and participants had the choice to do one or both of them. In each of those
tasks, learners were asked to write a narrative about a vacation trip and a discussion
about the participant's study interest. Those choosing the first task generated a
40 minute timed essay without the use of any language reference materials. In the
second task, participants completed the writing as a take-home assignment over two
days and were permitted to use language reference materials. The audio recordings
were developed by allowing students a limited amount of time to talk about the topics
above without using language reference materials. The original handwritten essays
were transcribed into an electronic text format. The corpus data consists of three
types: (1) handwritten sheets scanned in PDF format; (2) audio recordings in MP3 format;
and (3) textual unicode data in plain text and XML formats (including the transcribed
audio and transcripts of the handwritten essays). The audio files are either 44100Hz
2-channel or 16000Hz 1-channel mp3 files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alfaifi, Abdullah
ADDED ENTRY--PERSONAL NAME
- Personal name:
Atwell, Eric
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637289
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 583-709-024-480-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Arabic Broadcast Conversation Speech Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 was developed by the Linguistic
Data Consortium (LDC) and is comprised of approximately 123 hours of Arabic broadcast
conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat,
Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation)
program. Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation
Transcripts Part 1 (LDC2015T16). Broadcast audio for the GALE program was collected
at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong
Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia)
(Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast
collection supported GALE at a rate of approximately 300 hours per week of programming
from more than 50 broadcast sources for a total of over 30,000 hours of collected
broadcast audio over the life of the program. LDC’s local broadcast collection system
is highly automated, easily extensible and robust and capable of collecting, processing
and evaluating hundreds of hours of content from several dozen sources per day. The
broadcast material is served to the system by a set of free-to-air (FTA) satellite
receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast
satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between
receivers and recorders is dynamic and modular. All signal routing is performed under
computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high
bandwidth A/V format and are then processed to extract audio, to generate keyframes
and compressed audio/video, to produce time-synchronized closed captions (in the case
of North American English) and to generate automatic speech recognition (ASR) output.
An overview of the system, the sources recorded and the configuration of the recording
laboratory are contained in the Guidelines for Broadcast Audio Collection Version
3.0 included in this release. LDC designed a portable platform for remote broadcast
collection. This is a TiVO-style digital video recording (DVR) system that records
two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL)
and FTA DVB-S satellite programming and can operate outside of the United States.
It has a small footprint, weighs less than 30 pounds and can be transported as carry-on
luggage. Medianet collected Arabic programming from across the Gulf region using its
internal system and LDC's portable broadcast collection platform installed in 2008.
The portable platform deployed at the Medianet Tunisian collection facility collected
multiple streams of regional Arabic programming from various sources. MTC collected
Arabic programming using its internal collection system. *Data* The broadcast conversation
recordings in this release feature interviews, call-in programs and roundtable discussions
focusing principally on current events from the following sources: Abu Dhabi TV, a
television station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel,
based in Iran; Al Arabiya, a news television station based in Dubai; Aljazeera, a
regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station
in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Lebanese Broadcasting
Corporation, a Lebanese television station; Oman TV, a national broadcaster located
in the Sultanate of Oman; Saudi TV, a national television station based in Saudi Arabia;
and Syria TV, the national television station in Syria. This release contains 149
audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000
Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following
Audit Procedure Specification Version 2.0 which is included in this release. The broadcast
auditing process served three principal goals: as a check on the operation of the
broadcast collection system equipment by identifying failed, incomplete or faulty
recordings; as an indicator of broadcast schedule changes by identifying instances
when the incorrect program was recorded; and as a guide for data selection by retaining
information about a program’s genre, data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637297
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 591-679-153-987-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 was developed by the
Linguistic Data Consortium (LDC) and contains transcriptions of approximately 123
hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet,
Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous
Language Exploitation) program. Corresponding audio data is released as GALE Phase
3 Arabic Broadcast Conversation Speech Part 1 (LDC2015S11). The broadcast conversation
recordings for transcription feature interviews, call-in programs and roundtable discussions
focusing principally on current events from the following sources: Abu Dhabi TV, a
television station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel,
based in Iran; Al Arabiya, a news television station based in Dubai; Aljazeera, a
regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station
in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Lebanese Broadcasting
Corporation, a Lebanese television station; Oman TV, a national broadcaster located
in the Sultanate of Oman; Saudi TV, a national television station based in Saudi Arabia;
and Syria TV, the national television station in Syria. *Data* The transcript files
are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed
data totals 733,233 tokens. The transcripts were created with the LDC-developed transcription
tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that
supports manual transcription and annotation of audio recordings. XTrans is available
from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)
verbatim, time-aligned transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR annotation adds structural
information such as topic boundaries and manual sentence unit annotation to the core
components of a quick transcript. Files with QTR as part of the filename were developed
using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637300
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 000-414-019-590-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ACE 2007 Spanish DevTest - Pilot Evaluation
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ACE 2007 Spanish DevTest was developed by the Linguistic Data Consortium (LDC). This
publication contains the complete set of Spanish development and test data to support
the 2007 Automatic Content Extraction (ACE) technology evaluation, namely, newswire
data annotated for entities and temporal expressions. The objective of the ACE program
was to develop automatic content extraction technology to support automatic processing
of human language in text form from a variety of sources including newswire, broadcast
programming and weblogs. In the 2007 evaluation, participants were tested on system
performance for the recognition of entities, values, temporal expressions, relations,
and events in Chinese and English and for the recognition of entities and temporal
expressions in Arabic and Spanish. LDC's work in the ACE program is described in more
detail on the LDC ACE project pages. LDC has also released ACE 2007 Multilingual Training
Corpus (LDC2014T18) which contains the Arabic and Spanish training data used in the
2007 evaluation. *Data* The data consists of newswire material published in May 2005
from the following sources: Agence France Press, The Associated Press and Xinhua News
Agency. All files were annotated by two human annotators working independently. Discrepancies
between the two annotations were adjudicated by a senior team member resulting in
a gold standard file. There are three annotation directories for each newswire story
that contain an identical copy of the source text in SGML format and two associated
annotated versions in XML format and tab delimited format. All text is UTF-8 encoded.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637319
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 793-803-205-712-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
cat
- Language code of text/sound track or separate title:
por
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
cat
- Language code of text/sound track or separate title:
por
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NewSoMe Corpus of Opinion in News Reports
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NewSoMe Corpus of Opinion in News Reports was compiled at Barcelona Media and consists
of Spanish, Catalan and Portuguese news reports annotated for opinions. It is part
of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations
across several genres and covering multiple languages. NewSoMe is the result of an
effort to build a unifying annotation framework for analyzing opinion in different
genres, ranging from controlled text, such as news reports, to diverse types of user-generated
content that includes blogs, product reviews and microblogs. *Data* The source data
in this release was obtained from various newspaper websites and consists of approximately
200 documents in each of Spanish, Catalan and Portuguese. The annotation was carried
out manually through the crowdsourcing platform CrowdFlower with seven annotations
per layer that were aggregated for this data set. The layers annotated were topic,
segment, cue, subjectivity, polarity and intensity. Data is presented as UTF-8 either
as plain text or in CSV files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish, Catalan, and Portuguese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sauri, Roser
ADDED ENTRY--PERSONAL NAME
- Personal name:
Domingo, Judith
ADDED ENTRY--PERSONAL NAME
- Personal name:
Badia, Toni
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637327
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 161-465-005-066-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 was developed
by the Linguistic Data Consortium (LDC) and contains 243,038 tokens of word aligned
Chinese and English parallel text enriched with linguistic tags. This material was
used as training data in the DARPA GALE (Global Autonomous Language Exploitation)
program. Some approaches to statistical machine translation include the incorporation
of linguistic knowledge in word aligned text as a means to improve automatic word
alignment and machine translation quality. This is accomplished with two annotation
schemes: alignment and tagging. Alignment identifies minimum translation units and
translation relations by using minimum-match and attachment annotation approaches.
A set of word tags and alignment link tags are designed in the tagging scheme to describe
these translation units and relations. Tagging adds contextual, syntactic and language-specific
features to the alignment annotation. Other releases available in this series are:
* GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and
Web (LDC2012T16) * GALE Chinese-English Word Alignment and Tagging Training Part 2
-- Newswire (LDC2012T20) * GALE Chinese-English Word Alignment and Tagging Training
Part 3 -- Web (LDC2012T24) * GALE Chinese-English Word Alignment and Tagging Training
Part 4 -- Web (LDC2013T05) * GALE Chinese-English Word Alignment and Tagging -- Broadcast
Training Part 1 (LDC2013T23) * GALE Chinese-English Word Alignment and Tagging --
Broadcast Training Part 2 (LDC2014T25) * GALE Chinese-English Word Alignment and Tagging
-- Broadcast Training Part 3 (LDC2015T04) *Data* This release consists of Chinese
source broadcast conversation (BC) and broadcast news (BN) programming collected by
LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments
appears below: Language Genre Files Words CharTokens Segments Chinese BC 69 67,782
101,674 2,276 Chinese BN 29 94,242 141,364 3,152 Total 98 162,024 243,038 5,428 Note
that all token counts are based on the Chinese data only. One token is equivalent
to one character and one word is equivalent to 1.5 characters. The Chinese word alignment
tasks consisted of the following components: * Identifying, aligning, and tagging
eight different types of links * Identifying, attaching, and tagging local-level unmatched
words * Identifying and tagging sentence/discourse-level unmatched words * Identifying
and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic
link
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637335
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 859-613-244-400-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 and 4 Arabic Newswire Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 and 4 Arabic Newswire Parallel Text was developed by the Linguistic Data
Consrotium (LDC). Along with other corpora, the parallel text in this release comprised
training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source text and corresponding
English translations selected from newswire data collected by LDC in 2007 and 2008
and transcribed and translated by LDC or under its direction. LDC has also released
the following GALE Arabic Parallel Text data sets: * GALE Phase 1 Arabic Broadcast
News Parallel Text - Part 1 (LDC2007T24) * GALE Phase 1 Arabic Broadcast News Parallel
Text - Part 2 (LDC2008T09) * GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02) *
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03) * GALE Phase 1 Arabic
Newsgroup Parallel Text - Part 2 (LDC2009T09) * GALE Phase 2 Arabic Broadcast Conversation
Parallel Text Part 1 (LDC2012T06) * GALE Phase 2 Arabic Broadcast Conversation Parallel
Text Part 2 (LDC2012T14) * GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17)
* GALE Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18) * GALE Phase 2 Arabic
Web Parallel Text (LDC2013T01) * GALE Phase 3 and 4 Arabic Broadcast Conversation
Parallel Text (LDC2015T05) * GALE Phase 3 and 4 Arabic Broadcast News Parallel Text
(LDC2015T07) *Data* GALE Phase 3 and 4 Arabic Newswire Parallel Text includes 551
source-translation document pairs, comprising 156,775 tokens of Arabic source text
and its English translation. Data is drawn from seven distinct Arabic newswire sources:
Agence France Presse, Al Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat
and Assabah. The files in this release were transcribed by LDC staff and/or transcription
vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines
developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing
the text. Data was manually selected for translation according to several criteria,
including linguistic features, transcription features and topic features. The transcribed
and segmented files were then reformatted into a human-readable translation format
and assigned to translation vendors. Translators followed LDC's Arabic to English
translation guidelines. Bilingual LDC staff performed quality control procedures on
the completed translations. Source data and translations are distributed in TDF format.
TDF files are tab-delimited files containing one segment of text along with meta information
about that segment. Each field in the TDF file is described in TDF_format.txt. All
data are encoded in UTF-8. *Acknowledgement* This work was supported in part by the
Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or the policy
of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u ger d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637343
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 840-846-753-370-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Karlsruhe Children's Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
CONTENT TYPE
- Content type code:
still image
- Content type code:
sti
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Karlsruhe Children's Text was developed by the Cooperative State University Baden-Württemberg,
University of Education and Karlsruhe Institute of Technology. It consists of over
14,000 freely written, German sentences from more than 1,700 school children in grades
one through eight. The data collection was conducted in 2011-2013 at elementary and
secondary schools in and around Karlsruhe, Germany. Students were asked to write as
verbose a text as possible. Those in grades one to four were read two stories and
were then asked to write their own stories. Students in grades five through eight
were instructed to write on a specific theme, such as "Imagine the world in 20 years.
What has changed?" The goal of the collection was to use the data to develop a spelling
error classification system. *Data* Annotators converted the handwritten text into
digital form with all errors committed by the writers; they also created an orthographically
correct version of every sentence. Metadata about the text was gathered, including
the circumstances under which it was collected, information about the student writer
and background about spelling lessons in the particular class. In a second step, the
students' spelling errors were annotated into general groupings: grapheme level, syllable
level, morphology and syntax. The files were anonymized in a third step. This release
also contains metadata regarding the writers’ language biography, teaching methodology,
age, gender and school year. The average age of the participants was 11 years, and
the gender distribution was nearly equal. Original handwriting is presented as JPEG
format image files and the converted annotated text as UTF-8 plain text. Metadata
is contained within each text file.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in German. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Pictures
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fay, Johanna
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637351
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 607-221-014-735-8
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Articulation Index LSCP
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Articulation Index LSCP was developed by researchers at Laboratoire de Sciences Cognitives
et Psycholinguistique (LSCP), Ecole Normale Supérieure. It revises and enhances a
subset of Articulation Index (AIC) (LDC2005S22), a corpus of persons speaking English
syllables. Changes include the addition of forced alignment to sound files, time alignment
of syllable utterances and format conversions. AIC consists of 20 American English
speakers (12 males, 8 females) pronouncing syllables, some of which form actual words,
but most of which are nonsense syllables. All possible Consonant-Vowel (CV) and Vowel-Consonant
(VC) combinations were recorded for each speaker twice, once in isolation and once
within a carrier-sentence, for a total of 25768 recorded syllables. *Data* Articulation
Index LSCP alters AIC in the following ways. * Time-alignments for the onset and offset
of each word and syllable were generated through forced-alignment with a standard
HMM-GMM (Hidden Markov Model-Gaussian Mixture Model) ASR system. * The time-alignments
for the beginning and end of the syllables (whether in isolation or within a carrier
sentence) were manually adjusted. The time-alignments for the other words in carrier
sentences were not manually adjusted. * The recordings of isolated syllables were
cut according to the manual time-alignments to remove the silent portions at the beginning
and end, and the time-alignments were altered to correspond to the cut recordings.
* The file naming scheme was slightly altered for compatibility with the Kaldi speech
recognition toolkit. * AIC contains a wide-band (16 KHz, 16-bit PCM) and a narrow-band
(8 KHz, 8 bit u-law) version of the recordings distributed in sphere format. The LSCP
version contains the wide-band version only distributed as wave files. This release
does not include certain AIC triphone recordings (CVC, CCV or VCC). Audio data is
presented as 16kHz 16-bit flac compressed .wav files. The flac compression was added
for distribution, and documentation may refer to the files as .wav files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schatz, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cao, Xuan-Nga
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kolesnikova, Anna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bergvelt, Tomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wright, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dupoux, Emmanuel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u ara d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 866-063-772-506-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
KHATT: Handwritten Arabic Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
still image
- Content type code:
sti
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
KHATT: Handwritten Arabic Text was developed by King Fahd University of Petroleum
& Minerals, Technical University of Dortmund and Braunschweig University of Technology.
It is comprised of scanned Arabic handwriting from 1,000 distinct male and female
writers representing diverse countries, age groups, handedness and education levels.
Participants produced text on a topic of their choice in an unrestricted style. KHATT
was designed to promote research in areas such as text recognition and writer identification.
*Data* The majority of participants were natives of Saudi Arabia; the next largest
group was from a collection of regional countries (Egypt, Jordan, Kuwait, Morocco,
Palestine, Tunisia and Yemen). Most writers were between 16-25 years of age with high
school or university qualifications. Scanned text is presented as tiff images scanned
at 200, 300 and 600 DPI (dots per inch). The source images are four-page tiffs consisting
of metadata about the writer, fixed paragraphs and free writing. Image files of isolated
paragraphs or lines are also included. Ground-truth files are presented as plain-text
Unicode. Data is divided into training, validation and test sets.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Pictures
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mahmoud, Sabri A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ahmad, Irfan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Al-Khatib, Wasfi G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alshayeb, Mohammad
ADDED ENTRY--PERSONAL NAME
- Personal name:
Parvez, Mohammad Tanvir
ADDED ENTRY--PERSONAL NAME
- Personal name:
Märgner, Volker
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fink, Gernot A.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637378
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 768-351-261-530-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Chinese Newswire Parallel Sentences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Chinese Newswire Parallel Sentences was developed by the Linguistic Data
Consortium (LDC). Along with other corpora, the parallel text in this release comprised
training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Chinese source sentences and corresponding English translations
selected from newswire data collected by LDC in 2008 and translated by LDC or under
its direction. *Data* GALE Phase 4 Chinese Newswire Parallel Sentences includes 627
source-translation document pairs, comprising 90,434 tokens of Chinese source text
and its English translation. Data is drawn from six distinct Chinese newswire sources.
Sentences were selected for translation in two steps. First, files were chosen using
sentence selection scripts provided by GALE program participants SRI International
and IBM. The output was then manually reviewed by LDC staff to eliminate problematic
sentences. Selected files were reformatted into a human-readable translation format
and assigned to translation vendors. Translators followed LDC's Chinese to English
translation guidelines and were provided with the full source documents containing
the target sentences for their reference. Bilingual LDC staff performed quality control
procedures on the completed translations. Source data and translations are distributed
in TDF format. TDF files are tab-delimited files containing one segment of text along
with meta information about that segment. Each field in the TDF file is described
in TDF_format.txt. All data are encoded in UTF-8. *Acknowledgement* This work was
supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant
No. HR0011-06-1-0003. The content of this publication does not necessarily reflect
the position or the policy of the Government, and no official endorsement should be
inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637386
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015T25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 029-230-147-739-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Chinese Broadcast News Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015T25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Chinese Broadcast News Transcripts was developed by the Linguistic Data
Consortium (LDC) and contains transcriptions of approximately 150 hours of Chinese
broadcast news speech collected in 2007 and 2008 by LDC and Hong University of Science
and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. Corresponding audio data is released as GALE Phase
3 Chinese Broadcast News Speech (LDC2015S13). The broadcast news recordings for transcription
feature news broadcasts focusing principally on current events from the following
sources: Anhui TV, a regional television station in Mainland China, Anhui Province;
China Central TV (CCTV), a national and international broadcaster in Mainland China;
Phoenix TV, a Hong Kong-based satellite television station; and Voice of America (VOA),
a U.S. government-funded broadcast programmer. *Data* The transcript files are in
plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data
totals 1,933,695 tokens. The transcripts were created with the LDC-developed transcription
tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that
supports manual transcription and annotation of audio recordings. XTrans is available
from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)
verbatim, time-aligned transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR annotation adds structural
information such as topic boundaries and manual sentence unit annotation to the core
components of a quick transcript. Files with QTR as part of the filename were developed
using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015T25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2015 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637394
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2015S13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 249-589-798-222-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Chinese Broadcast News Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2015]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2015S13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Chinese Broadcast News Speech was developed by the Linguistic Data Consortium
(LDC) and is comprised of approximately 150 hours of Mandarin Chinese broadcast news
speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology
(HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast
News Transcripts (LDC2015T25). Broadcast audio for the GALE program was collected
at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: HKUST
(Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic).
The combined local and outsourced broadcast collection supported GALE at a rate of
approximately 300 hours per week of programming from more than 50 broadcast sources
for a total of over 30,000 hours of collected broadcast audio over the life of the
program. LDC’s local broadcast collection system is highly automated, easily extensible
and robust and capable of collecting, processing and evaluating hundreds of hours
of content from several dozen sources per day. The broadcast material is served to
the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite
systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable
television (CATV) feeds. The mapping between receivers and recorders is dynamic and
modular. All signal routing is performed under computer control, using a 256x64 A/V
matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed
to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized
closed captions (in the case of North American English) and to generate automatic
speech recognition (ASR) output. An overview of the system, the sources recorded and
the configuration of the recording laboratory are contained in the Guidelines for
Broadcast Audio Collection Version 3.0 included in this release. LDC designed a portable
platform for remote broadcast collection. This is a TiVO-style digital video recording
(DVR) system that records two streams of A/V material simultaneously. It supports
analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside
of the United States. It has a small footprint, weighs less than 30 pounds and can
be transported as carry-on luggage. HKUST collected Chinese broadcast programming
using its internal recording system and a portable broadcast collection platform designed
by LDC and installed at HKUST in 2006. *Data* The broadcast news recordings in this
release feature news broadcasts focusing principally on current events from the following
sources: Anhui TV, a regional television station in Mainland China, Anhui Province;
China Central TV (CCTV), a national and international broadcaster in Mainland China;
Phoenix TV, a Hong Kong-based satellite television station; and Voice of America (VOA),
a U.S. government-funded broadcast programmer. This release contains 279 audio files
presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel
16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure
Specification Version 2.0 which is included in this release. The broadcast auditing
process served three principal goals: as a check on the operation of the broadcast
collection system equipment by identifying failed, incomplete or faulty recordings;
as an indicator of broadcast schedule changes by identifying instances when the incorrect
program was recorded; and as a guide for data selection by retaining information about
a program’s genre, data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2015S13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ger d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637521
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 145-948-652-558-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
H1 Children's Writing
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
still image
- Content type code:
sti
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
H1 Children's Writing was developed by the Cooperative State University Baden-Württemberg,
University of Education. It consists of 996 texts written over three months by 88
German school children age seven through eleven years. The data in this corpus was
collected by an elementary school in Baden Württemberg, Germany and digitized at the
Cooperative State University during the second half of the 2014/2015 school year.
Three second and third grade classrooms participated in the collection. Texts were
written within regular class settings. The students were presented with a picture
and were asked to write a story, to describe the picture or if unable to write a text,
to list what they saw in the picture. The pictures were designed to enhance the output
with respect to important spelling error categories, namely, the marking of short
vowels with a silent consonant letter and the correct spelling of the long vowel.
The children were allowed at least 15 minutes to write the texts. This exercise was
repeated weekly for 12 weeks. LDC has also released H2, E2, ERK1 Children's Writing
(LDC2018T05). *Data* Most of the participants were multilingual. Out of 85 children
for whom metadata is available, 57 students were multilingual speakers and 28 students
were monolingual German speakers. The following metadata is included for each text
in the database: school week of collection; school type (always elementary school);
age; gender; grade/classroom; language spoken at home; and school materials used for
German (Jojo). In all, 996 texts representing 62,764 tokens were collected. The texts
were digitized in two forms: (1) the original text, including all errors (achieved),
and (2) the intended (target) text, where all spelling errors were removed. Annotations
were added to both the achieved text and the target text to distinguish words that
should not be analyzed for spelling errors, such as names or foreign words. For sentence-level
analysis, syntax errors were annotated by marking substitutions, deletions and insertions
at the word level. In such cases, the used word was analyzed for spelling, and the
correct word was used for sentence structure analysis. Original handwriting is presented
as pdf documents and the converted text as UTF-8 plain text in csv documents.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in German. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Pictures
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Berkling, Kay
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637408
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 805-617-544-777-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
NewSoMe Corpus of Opinion in Blogs
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
NewSoMe Corpus of Opinion in Blogs was compiled at Barcelona Media and consists of
English and Spanish blogs annotated for opinions. It is part of the NewSoMe (News
and Social Media) set of corpora presenting opinion annotations across several genres
and covering multiple languages. NewSoMe is the result of an effort to build a unifying
annotation framework for analyzing opinion in different genres, ranging from controlled
text, such as news reports, to diverse types of user-generated content that includes
blogs, product reviews and microblogs. LDC has also released NewSoMe Corpus of Opinion
in News Reports (LDC2015T17). *Data* The source data in this corpus was obtained by
means of the Google Blog Search API. Spanish blogs were taken from wordpress.com and
blogspot.com blogs. The English data was extracted from those same two domains and
from asiawrites.org. This release consists of 108 English documents and 191 Spanish
documents. The annotation was carried out manually through the crowdsourcing platform
CrowdFlower with seven annotations per layer that were aggregated for this data set.
The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.
Data is presented as UTF-8 either as plain text or in CSV files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sauri, Roser
ADDED ENTRY--PERSONAL NAME
- Personal name:
Domingo, Judith
ADDED ENTRY--PERSONAL NAME
- Personal name:
Badia, Toni
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637416
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 479-780-942-548-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Treebank - Weblog
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Treebank - Weblog was developed by the Linguistic Data Consortium (LDC) and
consists of Arabic weblog data with part-of-speech, morphology, gloss and syntactic
tree annotation. The ongoing Penn Arabic Treebank Project (PATB) supports research
in Arabic-language natural language processing and human language technology development.
The methodology and work leading to the release of this publication are described
in detail in the documentation accompanying this corpus. Generally, the PATB consists
of two distinct phases: (a) part-of-speech (POS) tagging, which divides the text into
lexical tokens and gives relevant information about each token such as lexical category,
inflectional features, and a gloss (referred to as POS for convenience, although it
includes morphological and gloss information not traditionally included with part-of-speech
annotation), and (b) Arabic treebanking, which characterizes the constituent structures
of word sequences, provides categories for each non-terminal node, and identifies
null elements, co-reference, traces and so on. *Data* This release contains 243,117
source tokens before clitics were split, and 308,996 tree tokens after clitics were
separated for treebank annotation. The source material is weblogs collected by LDC
from various sources. The data are released as plain-text, tdf and xml files in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic and Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maamouri, Mohamed
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kulick, Seth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krouna, Sondos
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tabassi, Dalila
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ciul, Michael
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637424
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 197-343-633-127-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Chinese Weblog Parallel Sentences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Chinese Weblog Parallel Sentences was developed by the Linguistic Data
Consortium (LDC). Along with other corpora, the parallel text in this release comprised
training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Chinese source sentences and corresponding English translations
selected from newsgroup and weblog data collected by LDC and translated by LDC or
under its direction. *Data* GALE Phase 4 Chinese Weblog Parallel Sentences includes
231 source-translation document pairs, comprising 92,501 tokens of Chinese source
text and its English translation. Sentences were selected for translation in two steps.
First, files were chosen using sentence selection scripts provided by GALE program
participants SRI International and IBM. The output was then manually reviewed by LDC
staff to eliminate problematic sentences. Selected files were reformatted into a human-readable
translation format and assigned to translation vendors. Translators followed LDC's
Chinese to English translation guidelines and were provided with the full source documents
containing the target sentences for their reference. Bilingual LDC staff performed
quality control procedures on the completed translations. Source data and translations
are distributed in TDF format. TDF files are tab-delimited files containing one segment
of text along with meta information about that segment. Each field in the TDF file
is described in TDF_format.txt. All data are encoded in UTF-8. *Acknowledgement* This
work was supported in part by the Defense Advanced Research Projects Agency, GALE
Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily
reflect the position or the policy of the Government, and no official endorsement
should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637432
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 682-988-480-192-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BOLT Chinese Discussion Forums
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
BOLT Chinese Discussion Forums was developed by the Linguistic Data Consortium (LDC)
and consists of 1,597,500 discussion forum threads in Chinese harvested from the Internet
using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational
Language Translation) program developed machine translation and information retrieval
for less formal genres, focusing particularly on user-generated content. LDC supported
the BOLT program by collecting informal data sources -- discussion forums, text messaging
and chat -- in Chinese, Egyptian Arabic and English. The material in this release
represents the unannotated Chinese source data in the discussion forum genre. The
data was subseqently translated and annotated for various tasks in the BOLT program
including word alignment, treebanking, propbanking and co-reference. *Data* Collection
was seeded based on the results of manual data scouting by native speaker annotators.
Scouts were instructed to seek content in Mandarin Chinese that was original, interactive
and informal. Upon locating an appropriate thread, scouts submitted the URL and some
simple judgments about it to a database, via a web browser plug-in. When multiple
threads from a forum were submitted, the entire forum was automatically harvested
and added to the collection. The scale of the collection precluded manual review of
all data. Only a small portion of the threads included in this release were manually
reviewed, and it is expected that there may be some offensive or otherwise undesired
content as well as some threads that contain a large amount of non-Chinese content.
Language identification was performed on all threads in this corpus (using CLD2),
and threads for which the results indicated a high probability of largely non-Chinese
content are listed in cmn_suspect_LID.txt in the docs directory of this package. The
corpus is comprised of HTML and XML files. The HTML files are a raw HTML file downloaded
from the discussion thread. If the thread spanned multiple URLs, it was stored as
a concatenation of the downloaded HTML files. The XML files were converted from the
raw HTML. *Acknowledgement* This material is based upon work supported by the Defense
Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The
content does not necessarily reflect the position or the policy of the Government,
and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tracey, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637440
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 503-163-355-082-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 was developed by the
Linguistic Data Consortium (LDC) and contains transcriptions of approximately 129
hours of Arabic broadcast conversation speech collected in 2007 and 2008 by LDC, MediaNet,
Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous
Language Exploitation) program. Corresponding audio data is released as GALE Phase
3 Arabic Broadcast Conversation Speech Part 2 (LDC2016S01). The broadcast conversation
recordings feature interviews, call-in programs and roundtable discussions focusing
principally on current events from the following sources: Abu Dhabi TV, a television
station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran;
Al Arabiya, a news television station based in Dubai; Al Baghdadya, an Iraqi broadcast
programmer based in Egypt; Al Fayha, an Iraqi television channel; Al Hiwar, a regional
broadcast station based in the United Kingdom; Alhurra, a U.S. government-funded regional
broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah,
a national broadcast station in Jordan; Bahrain TV, a television station in the Kingdom
of Bahrain; Dubai TV, a broadcast station in the United Arab Emirates; Kuwait TV,
a national broadcast station in Kuwait; Oman TV, a national broadcaster located in
the Sultanate of Oman; Qatar TV, a broadcast programmer in Qatar; Saudi TV, a national
television station based in Saudi Arabia; Syria TV, the national television station
in Syria; and Tunisian National TV, a national television station in Tunisia. *Data*
The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding,
and the transcribed data totals 845,791 tokens. The transcripts were created with
the LDC tool, XTrans, which supports manual transcription and annotation of audio
recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)
verbatim, time-aligned transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR annotation adds structural
information such as topic boundaries and manual sentence unit annotation to the core
components of a quick transcript. Files with QTR as part of the filename were developed
using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637459
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 186-068-119-332-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Arabic Broadcast Conversation Speech Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 was developed by the Linguistic
Data Consortium (LDC) and is comprised of approximately 129 hours of Arabic broadcast
conversation speech collected in 2007 and 2008 by LDC, MediaNet, Tunis, Tunisia and
MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation)
program. Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation
Transcripts Part 2 (LDC2016T06). Broadcast audio for the GALE program was collected
at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong
Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia)
(Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast
collection supported GALE at a rate of approximately 300 hours per week of programming
from more than 50 broadcast sources for a total of over 30,000 hours of collected
broadcast audio over the life of the program. LDC’s local broadcast collection system
is highly automated, easily extensible and robust and capable of collecting, processing
and evaluating hundreds of hours of content from several dozen sources per day. The
broadcast material is served to the system by a set of free-to-air (FTA) satellite
receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast
satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between
receivers and recorders is dynamic and modular. All signal routing is performed under
computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high
bandwidth A/V format and are then processed to extract audio, to generate keyframes
and compressed audio/video, to produce time-synchronized closed captions (in the case
of North American English) and to generate automatic speech recognition (ASR) output.
An overview of the system, the sources recorded and the configuration of the recording
laboratory are contained in the Guidelines for Broadcast Audio Collection Version
3.0 included in this release. LDC designed a portable platform for remote broadcast
collection. This is a TiVO-style digital video recording (DVR) system that records
two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL)
and FTA DVB-S satellite programming and can operate outside of the United States.
It has a small footprint, weighs less than 30 pounds and can be transported as carry-on
luggage. Medianet collected Arabic programming from across the Gulf region using its
internal system and LDC's portable broadcast collection platform installed in 2008.
The portable platform deployed at the Medianet Tunisian collection facility collected
multiple streams of regional Arabic programming from various sources. MTC collected
Arabic programming using its internal collection system. *Data* The broadcast conversation
recordings in this release feature interviews, call-in programs and roundtable discussions
focusing principally on current events from the following sources: Abu Dhabi TV, a
television station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel,
based in Iran; Al Arabiya, a news television station based in Dubai; Al Baghdadya,
an Iraqi broadcast programmer based in Egypt; Al Fayha, an Iraqi television channel;
Al Hiwar, a regional broadcast station based in the United Kingdom; Alhurra, a U.S.
government-funded regional broadcaster; Aljazeera, a regional broadcaster located
in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Bahrain TV,
a television station in the Kingdom of Bahrain; Dubai TV, a broadcast station in the
United Arab Emirates; Kuwait TV, a national broadcast station in Kuwait; Oman TV,
a national broadcaster located in the Sultanate of Oman ; Qatar TV, a broadcast programmer
in Qatar; Saudi TV, a national television station based in Saudi Arabia; Syria TV,
the national television station in Syria; and Tunisian National TV, a national television
station in Tunisia. This release contains 142 audio files presented in FLAC-compressed
Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file
was audited by a native Arabic speaker following Audit Procedure Specification Version
2.0 which is included in this release. The broadcast auditing process served three
principal goals: as a check on the operation of the broadcast collection system equipment
by identifying failed, incomplete or faulty recordings; as an indicator of broadcast
schedule changes by identifying instances when the incorrect program was recorded;
and as a guide for data selection by retaining information about a program’s genre,
data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637467
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 203-805-101-705-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
yue
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c was developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 215 hours of Cantonese conversational and scripted telephone speech
collected in 2011 along with corresponding transcripts. The Babel program focuses
on underserved languages and seeks to develop speech recognition technology that can
be rapidly applied to any human language to support keyword search performance over
large amounts of recorded speech. *Data* The Cantonese speech in this release represents
that spoken in the Chinese provinces of Guangdong and Guangxi, and within those provinces,
among five dialect groups. The gender distribution among speakers is approximately
even; speakers' ages range from 16 years to 67 years. Calls were made using different
telephones (e.g., mobile, landline) from a variety of environments including the street,
a home or office, a public place, and inside a vehicle. All audio data is presented
as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two
versions: simplified Chinese characters and a romanization scheme based on the Yale
system, both encoded in UTF-8. Further information about transcription methodology
is contained in the documentation accompanying this release. Evaluation data is available
from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Yue Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Andrus, Tony
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gillies, Breanna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hazen, T.J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hefright, Brook
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jarrett, Amy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lin, Willa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wong, Jamie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637475
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 732-088-594-631-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
DEFT Narrative Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
DEFT Narrative Text was developed by the Linguistic Data Consortium (LDC) and contains
proxy reports and their source newswire used to support DARPA's Deep Exploration and
Filtering of Text (DEFT) program. Among the goals of the DEFT program is to develop
technologies that can perform various NLP tasks on data in a variety of genres, both
formal and informal. LDC provided source data and annotations for DEFT system development.
DEFT Narrative Text consists of "proxy reports" (and "multi-proxy reports") in English.
(Multi-)proxy reports are intended to mimic the format and other features of some
types of government analyst reports using content from newswire articles. The corresponding
English newswire source documents are also included in the release. *Data* LDC staff
manually selected the source newswire from English Gigaword Fifth Edition (LDC2011T07).
Articles were selected for topics of potential interest to the intelligence community
based on general guidance from DEFT project sponsors. The newswire source documents
are XML files following the Gigaword corpus format. The proxy reports are in plain
text format.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tracey, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fore, Dana
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637483
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 761-085-570-786-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 and 4 Arabic Web Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 and 4 Arabic Web Parallel Text was developed by the Linguistic Data Consortium
(LDC). Along with other corpora, the parallel text in this release comprised training
data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source text and corresponding
English translations selected from weblog and newsgroup data collected by LDC and
translated by LDC or under its direction. LDC has also released the following GALE
Arabic Parallel Text data sets: * GALE Phase 1 Arabic Broadcast News Parallel Text
- Part 1 (LDC2007T24) * GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
(LDC2008T09) * GALE Phase 1 Arabic Blog Parallel Text (LDC2008T02) * GALE Phase 1
Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03) * GALE Phase 1 Arabic Newsgroup
Parallel Text - Part 2 (LDC2009T09) * GALE Phase 2 Arabic Broadcast Conversation Parallel
Text Part 1 (LDC2012T06) * GALE Phase 2 Arabic Broadcast Conversation Parallel Text
Part 2 (LDC2012T14) * GALE Phase 2 Arabic Newswire Parallel Text (LDC2012T17) * GALE
Phase 2 Arabic Broadcast News Parallel Text (LDC2012T18) * GALE Phase 2 Arabic Web
Parallel Text (LDC2013T01) * GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel
Text (LDC2015T05) * GALE Phase 3 and 4 Arabic Broadcast News Parallel Text (LDC2015T07)
* GALE Phase 3 and 4 Arabic Newswire Parallel Text (LDC2015T19) *Data* GALE Phase
3 and 4 Arabic Web Parallel Text includes 124 source-translation document pairs, comprising
61,662 tokens of Arabic source text and its English translation. Data is drawn from
various Arabic weblog and newsgroup sources. Data was manually selected for translation
according to several criteria, including linguistic features, transcription features
and topic features. The files were reformatted into a human-readable translation format
and assigned to translation vendors. Translators followed LDC's Arabic to English
translation guidelines. Bilingual LDC staff performed quality control procedures on
the completed translations. Source data and translations are distributed in TDF format.
TDF files are tab-delimited files containing one segment of text along with meta information
about that segment. Each field in the TDF file is described in TDF_format.txt. All
data are encoded in UTF-8. *Acknowledgement* This work was supported in part by the
Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or the policy
of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637491
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 454-658-535-275-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text was developed by the
Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this
release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. This corpus contains Chinese source text and corresponding
English translations selected from broadcast conversation data collected by LDC between
2006 and 2008 and transcribed and translated by LDC or under its direction. *Data*
GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text includes 63 source-translation
document pairs, comprising 487,466 tokens of Chinese source text and its English translation.
Data is drawn from 19 distinct Chinese programs broadcast between 2006 and 2008 by
Anhui TV, a regional television station in Mainland China, Anhui Province; Beijing
TV, a national television station in Mainland China; China Central TV (CCTV), a national
and international broadcaster in Mainland China; Dongfang TV, a regional broadcaster
based in Shanghai, China that also covers Macau, Hong Kong and Taiwan; Hubei TV, a
regional television station in Mainland China, Hubei Province; and Phoenix TV, a Hong
Kong-based satellite television station. Broadcast conversation programming is generally
more interactive than traditional news broadcasts and includes talk shows, interviews,
call-in programs and roundtables. The files in this release were transcribed by LDC
staff and/or transcription vendors under contract to LDC in accordance with the Quick
Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries
in addition to transcribing the text. Data was manually selected for translation according
to several criteria, including linguistic features, transcription features and topic
features. The transcribed and segmented files were then reformatted into a human-readable
translation format and assigned to translation vendors. Translators followed LDC's
Chinese to English translation guidelines. Bilingual LDC staff performed quality control
procedures on the completed translations. Source data and translations are distributed
in TDF format. TDF files are tab-delimited files containing one segment of text along
with meta information about that segment. Each field in the TDF file is described
in TDF_format.txt. All data are encoded in UTF-8. *Acknowledgement* This work was
supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant
No. HR0011-06-1-0003. The content of this publication does not necessarily reflect
the position or the policy of the Government, and no official endorsement should be
inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, Chinese, and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 551-020-529-029-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
cze
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
ces
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing consists of data, tools,
system results, and publications associated with the 2014 and 2015 tasks on Broad-Coverage
Semantic Dependency Parsing (SDP) conducted in conjunction with the International
Workshop on Semantic Evaluation (SemEval) and was developed by the SDP task organizers.
SemEval is an ongoing series of evaluations of computational semantic analysis systems
intended to explore the nature of meaning in language. It evolved from the Senseval
word sense disambiguation series to include semantic analysis tasks outside of word
sense disambiguation. *Data* SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
is based on English, Chinese and Czech data from the following resources: Treebank-2
LDC95T17, Proposition Bank I LDC2004T14, NomBank v 1.0 LDC2008T23 and CCGBank LDC2005T13
(English); Chinese Treebank (e.g., Chinese Treebank 8.0 LDC2013T21) (Chinese); and
Prague Dependency Treebank (e.g., Prague Dependency Treebank 2.0, LDC2006T01) (Czech).
The results are presented as graphs in three target representations: MRS-Derived Semantic
Dependencies (DM), Enju Predicate–Argument Structures (PAS), and Prague Semantic Dependencies
(PSD). As a fourth, additional target representation CCGbank was converted to semantic
dependency graphs (in the subdirectory ‘ccd’). These graphs are aligned with the graphs
released in connection with SDP 2015 for English. The data is divided into three main
directories: * ‘2014/’ — the data, tools, and system results from Task 8 at SemEval
2014 * ‘2015/’ — the data, tools, and system results from Task 18 at SemEval 2015
* ‘ccd/’ — the new set of semantic dependency graphs derived from CCGbank In the 2014
and 2015 sub-directories, the file layout preserves the original conventions used
for data distribution to SemEval participants, so as to make it easy to replicate
published results. Each sub-directory (including the new ‘ccd/’) provides its own
file ‘README.txt’ with additional instructions.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Chinese, and Czech. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Flickinger, Dan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hajič, Jan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ivanova, Angelina
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kuhlmann, Marco
ADDED ENTRY--PERSONAL NAME
- Personal name:
Miyao, Yusuke
ADDED ENTRY--PERSONAL NAME
- Personal name:
Oepen, Stephan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zeman, Daniel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637505
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 101-138-641-339-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences was developed by the
Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this
release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Modern Standard Arabic source sentences
and corresponding English translations selected from broadcast conversation data collected
by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.
*Data* GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences includes 170
source-translation document pairs, comprising 44,064 words (Arabic source) of translated
data. Data is drawn from 45 distinct Arabic broadcast conversation (BC) sources. BC
programming is more interactive than traditional broadcast news sources and may include
talk shows, interviews, call-in programs and roundtables. The data was transcribed
by LDC staff and/or transcription vendors under contract to LDC in accordance with
the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence
boundaries in addition to transcribing the text. Sentences were selected for translation
in two steps. First, files were chosen using sentence selection scripts provided by
GALE program participants SRI International and IBM. The output was then manually
reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted
into a human-readable translation format and assigned to translation vendors. Translators
followed LDC's Arabic to English translation guidelines and were provided with the
full source documents containing the target sentences for their reference. Bilingual
LDC staff performed quality control procedures on the completed translations. Source
data and translations are distributed in TDF format. TDF files are tab-delimited files
containing one segment of text along with meta information about that segment. Each
field in the TDF file is described in TDF_format.txt. All data are encoded in UTF-8.
*Acknowledgement* This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication
does not necessarily reflect the position or the policy of the Government, and no
official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic, Arabic, and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637513
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016V01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 609-210-869-474-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
HAVIC Pilot Transcription
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016V01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
HAVIC Pilot Transcription was developed by the Linguistic Data Consortium (LDC) and
is comprised of approximately 72 hours of user-generated videos with transcripts based
on the English speech audio extracted from the videos. This data set was created in
collaboration with NIST (the National Institute of Standards and Technology) as part
of the HAVIC (the Heterogeneous Audio Visual Internet Collection) project, the goal
of which is to advance multimodal event detection and related technologies. LDC has
developed a large, heterogeneous, annotated multimodal corpus for HAVIC that has been
used in the NIST-sponsored MED (Multimedia Event Detection) task for several years.
HAVIC Pilot Transcription supported an experiment to produce a verbatim transcript
(quick and rich transcription) based on audio extracted from user-generated videos.
It contains the pilot transcripts for selected MED 2011 video files as well as the
associated videos. *Data* NIST designated the videos to be transcribed. Annotators
generated the transcripts using XTrans, which supports manual transcription across
multiple channels, languages and platforms. HAVIC transcription guidelines are included
in the documentation for this release. Each file was transcribed by a single annotator
with no corpus-wide second pass. File samples from each annotator were checked for
various errors, including missing transcription, improper mark-up, poor segmentation
and missing/added words. All transcription files are in .tdf format, a plain-text,
flat-table format with 13 tab-delimited fields. All video files are in .mp4 format
(h264), with varying bit-rates and levels of audio fidelity and video resolution.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tracey, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morris, Amanda
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Antonishek, Brian
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016V01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637548
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 358-076-735-024-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Chinese Broadcast Conversation Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Chinese Broadcast Conversation Transcripts was developed by the Linguistic
Data Consortium (LDC) and contains transcriptions of approximately 172 hours of Chinese
broadcast conversation speech collected in 2008 by LDC and Hong Kong University of
Science and Technology, Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. Corresponding audio data is released as GALE Phase
4 Chinese Broadcast Conversation Speech (LDC2016S03). The broadcast conversation recordings
feature interviews, call-in programs and roundtable discussions focusing principally
on current events from the following sources: Beijing TV, a national television station
in Mainland China; China Central TV (CCTV), a national and international broadcaster
in Mainland China; Hubei TV, a regional television station in Mainland China, Hubei
Province; Phoenix TV, a Hong Kong-based satellite television station ; and Voice of
America (VOA), a U.S. government-funded broadcast programmer. *Data* The transcript
files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed
data totals 2,259,952 tokens. The transcripts were created with the LDC-developed
transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription
tool that supports manual transcription and annotation of audio recordings. XTrans
is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)
verbatim, time-aligned transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR annotation adds structural
information such as topic boundaries and manual sentence unit annotation to the core
components of a quick transcript. Files with QTR as part of the filename were developed
using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637556
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 561-327-935-781-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Chinese Broadcast Conversation Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Chinese Broadcast Conversation Speech was developed by the Linguistic
Data Consortium (LDC) and is comprised of approximately 172 hours of Mandarin Chinese
broadcast conversation speech collected in 2008 by LDC and Hong Kong University of
Science and Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global
Autonomous Language Exploitation) Program. Corresponding transcripts are released
as GALE Phase 4 Chinese Broadcast Conversation Transcripts (LDC2016T12). Broadcast
audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities
and at three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic),
and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection
supported GALE at a rate of approximately 300 hours per week of programming from more
than 50 broadcast sources for a total of over 30,000 hours of collected broadcast
audio over the life of the program. LDC’s local broadcast collection system is highly
automated, easily extensible and robust and capable of collecting, processing and
evaluating hundreds of hours of content from several dozen sources per day. The broadcast
material is served to the system by a set of free-to-air (FTA) satellite receivers,
commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite
(DBS) receivers, and cable television (CATV) feeds. The mapping between receivers
and recorders is dynamic and modular. All signal routing is performed under computer
control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth
A/V format and are then processed to extract audio, to generate keyframes and compressed
audio/video, to produce time-synchronized closed captions (in the case of North American
English) and to generate automatic speech recognition (ASR) output. An overview of
the system, the sources recorded and the configuration of the recording laboratory
are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included
in this release. LDC designed a portable platform for remote broadcast collection.
This is a TiVO-style digital video recording (DVR) system that records two streams
of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S
satellite programming and can operate outside of the United States. It has a small
footprint, weighs less than 30 pounds and can be transported as carry-on luggage.
HKUST collected Chinese broadcast programming using its internal recording system
and a portable broadcast collection platform designed by LDC and installed at HKUST
in 2006. *Data* The broadcast conversation recordings in this release feature interviews,
call-in programs and roundtable discussions focusing principally on current events
from the following sources: Beijing TV, a national television station in Mainland
China; China Central TV (CCTV), a national and international broadcaster in Mainland
China; Hubei TV, a regional television station in Mainland China, Hubei Province;
Phoenix TV, a Hong Kong-based satellite television station ; and Voice of America
(VOA), a U.S. government-funded broadcast programmer. This release contains 236 audio
files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel
16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure
Specification Version 2.0 which is included in this release. The broadcast auditing
process served three principal goals: as a check on the operation of the broadcast
collection system equipment by identifying failed, incomplete or faulty recordings;
as an indicator of broadcast schedule changes by identifying instances when the incorrect
program was recorded; and as a guide for data selection by retaining information about
a program’s genre, data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637572
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 219-696-236-485-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese Treebank 9.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese Treebank 9.0 consists of approximately two million words of annotated and
parsed text from Chinese newswire, government documents, magazine articles, various
broadcast news and broadcast conversation programs, web newsgroups, weblogs, discussion
forums, chat messages and transcribed conversational telephone speech. The Chinese
Treebank project began at the University of Pennsylvania in 1998, continued at the
University of Colorado and then moved to Brandeis University. The project's goal is
to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus.
The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated
words from Xinhua News Agency newswire. It was later corrected and released in 2001
as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words.
LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly
400,000 words, in 2004. A year later, LDC published the 500,000 word Chinese Treebank
5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of
780,000 words. Chinese Treebank 7.0 (LDC2010T08), released in 2010, added new annotated
newswire data, broadcast material and web text to the approximate total of one million
words. Chinese Treebank 8.0 (LDC2013T21) included new annotated data from newswire,
magazine articles and government documents. Chinese Treebank 9.0 adds more annotated
web data and two new genres - chat messages and transcribed conversational telephone
speech. *Data* There are 3,726 text files in this release, containing 132,076 sentences,
2,084,387 words, 3,247,331 characters (hanzi or foreign). The data is provided in
the UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. Details
of the annotation standard can be found in the enclosed segmentation, POS-tagging
and bracketing guidelines. The data is provided in four different formats: raw text,
word segmented, POS-tagged, and syntactically bracketed formats. All files were automatically
verified and manually checked.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Chinese and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zhang, Xiuhong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jiang, Zixin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chiou, Fu-Dong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chang, Meiyu
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637564
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 649-160-209-726-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory
of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM)
and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcripts,
and speaker metadata. The goal of this work was to support spoken term detection and
forensic speaker identification. *Data* This corpus is comprised of Mexican Spanish
microphone speech from 75 male speakers and 75 female speakers in a quiet office environment.
Speakers could answer pre-selected open questions or describe a particular painting
shown to them on a computer monitor. Speaker metadata in this release includes age,
gender, place of birth, place of residence and parents' nationalities. The audio files
are presented as to 16 kHz, 16-bit PCM flac compressed wav.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mena, Carlos Daniel Hernández
ADDED ENTRY--PERSONAL NAME
- Personal name:
Herrera, Abel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637580
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 862-638-920-382-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Arabic Weblog Parallel Sentences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Arabic Weblog Parallel Sentences was developed by the Linguistic Data
Consortium (LDC). Along with other corpora, the parallel text in this release comprised
training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source text and corresponding
English translations, selected from newsgroup and weblog data collected by LDC and
translated by LDC or under its direction. *Data* GALE Phase 4 Arabic Weblog Parallel
Sentences includes 1,067 source-translation document pairs, comprising 68,346 words
(Arabic source) of translated data. Data is drawn from various Arabic newsgroup and
weblog sources. Sentences were selected for translation in two steps. First, files
were chosen using sentence selection scripts provided by GALE program participants
SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate
problematic sentences. Selected files were reformatted into a human-readable translation
format and assigned to translation vendors. Translators followed LDC's Arabic to English
translation guidelines and were provided with the full source documents containing
the target sentences for their reference. Bilingual LDC staff performed quality control
procedures on the completed translations. Source data and translations are distributed
in TDF format. TDF files are tab-delimited files containing one segment of text along
with meta information about that segment. Each field in the TDF file is described
in TDF_format.txt. All data are encoded in UTF-8. *Acknowledgement* This work was
supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant
No. HR0011-06-1-0003. The content of this publication does not necessarily reflect
the position or the policy of the Government, and no official endorsement should be
inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic and Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637599
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 938-012-923-511-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 and 4 Chinese Broadcast News Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 and 4 Chinese Broadcast News Parallel Text was developed by the Linguistic
Data Consortium (LDC). Along with other corpora, the parallel text in this release
comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Chinese source text and corresponding
English translations selected from broadcast news data collected by LDC between 2006
and 2008 and transcribed and translated by LDC or under its direction. *Data* GALE
Phase 3 and 4 Chinese Broadcast News Parallel Text includes 76 source-translation
document pairs, comprising 614,608 tokens of Chinese source text and its English translation.
Data is drawn from 16 distinct Chinese programs broadcast between 2006 and 2008 by
China Central TV (CCTV), a national and international broadcaster in Mainland China
and Phoenix TV, a Hong Kong-based satellite television station. The programs in this
release feature news programs on current events topics. The files in this release
were transcribed by LDC staff and/or transcription vendors under contract to LDC in
accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the text. Data was manually
selected for translation according to several criteria, including linguistic features,
transcription features and topic features. The transcribed and segmented files were
then reformatted into a human-readable translation format and assigned to translation
vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual
LDC staff performed quality control procedures on the completed translations. Source
data and translations are distributed in TDF format. TDF files are tab-delimited files
containing one segment of text along with meta information about that segment. Each
field in the TDF file is described in TDF_format.txt. All data are encoded in UTF-8.
*Acknowledgement* This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication
does not necessarily reflect the position or the policy of the Government, and no
official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637602
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 600-613-246-063-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English Speed Networking Conversational Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
English Speed Networking Conversational Transcripts was developed at the University
of the West of England and contains 388 transcripts of English face-to-face and instant
messaging conversations about business ideas collected in 2014 and 2015 from participants
(undergraduate students) playing different power roles. This corpus was created to
examine communication accommodation, specifically, the ways in which an individual's
linguistic style, or how an individual communicates, is affected by social power and
personality. The data was collected in two studies. In the first study, 40 participants
had a series of paired five minute face-to-face conversations playing either a high,
low or neutral power role. The same procedure was followed in the second study except
that participants discussed business ideas via instant messaging. *Data* The face-to-face
conversations were audio-recorded and transcribed verbatim. There are 139 transcripts
of conversations between high and low power individuals and 63 transcripts of conversations
between neutral power individuals. The instant messaging program automatically saved
the transcripts of the messaging conversations; the transcripts were then retrieved
and formatted for analysis. There are 85 transcripts of conversations between high
and low power individuals and 101 transcripts of conversations between neutral power
individuals. The transcripts were anonymized. Gender and age metadata are available
where provided. All transcripts are presented as UTF-8 plain text files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Muir, Kate
ADDED ENTRY--PERSONAL NAME
- Personal name:
Joinson, Adam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cotterill, Rachel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dewdney, Nigel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637610
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 920-059-271-034-1
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Digital Archive of Southern Speech - NLP Version
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Digital Archive of Southern Speech - NLP Version (DASS-NLP) was developed by LDC as
an alternate version of Digital Archive of Southern Speech (DASS) (LDC2012S03) suitable
for natural language processing and human language technology applications. Specifically,
the original audio files have been converted to 16kHz 16-bit flac compressed wav and
file names have been normalized to facilitate automatic processing. DASS was developed
by the University of Georgia. It is a subset of the Linguistic Atlas of the Gulf States
(LAGS), which is in turn part of the Linguist Atlas Project (LAP). DASS-NLP contains
approximately 366 hours of English speech data from 30 female speakers and 34 male
speakers in flac compressed wav format, along with associated metadata about the speakers
and the recordings and maps in .jpeg format relating to the recording locations. LAP
consists of a set of survey research projects about the words and pronunciation of
everyday American English, the largest project of its kind in the United States. Interviews
with thousands of native speakers across the country have been carried out since 1929.
LAGS surveyed the everyday speech of Georgia, Tennessee, Florida, Alabama, Mississippi,
Arkansas, Louisiana, and Texas in a series of 914 audio-taped interviews conducted
from 1968-1983. Interviews average approximately six hours in length; the systematic
LAGS tape archive amounts to 5500 hours of sound recordings. DASS is a collection
of 64 interviews from LAGS selected to cover a range of speech across the region and
to represent multiple education levels and ethnic backgrounds. *Data* The DASS-NLP
speakers' average age is 61 years; there are 30 women and 34 men from the Gulf States
region represented in this release. The interviews cover common topics such as family,
the weather, household articles and activities, agriculture and social connections.
The interviews were originally recorded in the field on reel-to-reel audio tape. A
digital version of every reel of tape was then made, one .wav file per reel, usually
about one hour of sound. Each interview thus consists of a set of 3 to 13 reels, or
roughly 3 to 13 interview hours. Personally identifying or sensitive information in
the files was replaced with a tone to protect the privacy and to assure ethical treatment
of speakers.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kretzschmar, William A., Jr.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bounds, Paulina
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hettel, Jacqueline
ADDED ENTRY--PERSONAL NAME
- Personal name:
Coats, Steven
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pederson, Lee
ADDED ENTRY--PERSONAL NAME
- Personal name:
Opas-Hänninen, Lisa Lena
ADDED ENTRY--PERSONAL NAME
- Personal name:
Juuso, Ilkka
ADDED ENTRY--PERSONAL NAME
- Personal name:
Seppänen, Tapio
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u asm d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637629
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 108-288-201-055-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
asm
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
asm
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a was developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 205 hours of Assamese conversational and scripted telephone speech collected
in 2012 and 2013 along with corresponding transcripts. The Babel program focuses on
underserved languages and seeks to develop speech recognition technology that can
be rapidly applied to any human language to support keyword search performance over
large amounts of recorded speech. *Data* The speech in this release represents three
dialects spoken in Assam, a state in northeastern India. The gender distribution among
speakers is approximately even; speakers' ages range from 16 years to 66 years. Calls
were made using different telephones (e.g., mobile, landline) from a variety of environments
including the street, a home or office, a public place, and inside a vehicle. All
audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts
are available in two versions: Assamese script and a romanization scheme developed
by Appen Butler Hill, both encoded in UTF-8. Further information about transcription
methodology is contained in the documentation accompanying this release. Evaluation
data is available from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Assamese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
David, Anne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gillies, Breanna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gnanadesikan, Amalia
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hammond, Simon
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jarrett, Amy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Molina, María Encarnación Pérez
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Paget, Shelley
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silber, Ronnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wong, Jamie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ben d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637637
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 306-240-490-682-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
ben
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ben
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b was developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 215 hours of Bengali conversational and scripted telephone speech collected
in 2011 and 2012 along with corresponding transcripts. The Babel program focuses on
underserved languages and seeks to develop speech recognition technology that can
be rapidly applied to any human language to support keyword search performance over
large amounts of recorded speech. *Data* The Bengali speech in this release represents
that spoken in India by native speakers of Bengali born in India. The gender distribution
among speakers is approximately even; speakers' ages range from 16 years to 65 years.
Calls were made using different telephones (e.g., mobile, landline) from a variety
of environments including the street, a home or office, a public place, and inside
a vehicle. All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere
format. Transcripts are available in two versions: the Bengali script and a romanization
scheme developed by Appen Butler Hill, both encoded in UTF-8. Further information
about transcription methodology is contained in the documentation accompanying this
release. Evaluation data is available from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Bengali. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
David, Anne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gillies, Breanna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jarrett, Amy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Molina, María Encarnación Pérez
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Paget, Shelley
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silber, Ronnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wong, Jamie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637645
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 148-357-338-423-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Arabic Broadcast News Transcripts Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Arabic Broadcast News Transcripts Part 1 was developed by the Linguistic
Data Consortium (LDC) and contains transcriptions of approximately 132 hours of Arabic
broadcast news speech collected in 2007 by the Linguistic Data Consortium (LDC), MediaNet,
Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous
Language Exploitation) program. Corresponding audio data is released as GALE Phase
3 Arabic Broadcast News Speech Part 1 (LDC2016S07). The broadcast news recordings
for transcription feature news broadcasts focusing principally on current events from
the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, Al Alam
News Channel, based in Iran; Al Arabiya, a news television station based in Dubai;
Al Iraqiyah, an Iraqi television station; Aljazeera , a regional broadcaster located
in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, a
broadcast station in the United Arab Emirates; Kuwait TV, a national broadcast station
in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station; Nile
TV, a broadcast programmer based in Egypt, Saudi TV, a national television station
based in Saudi Arabia; and Syria TV, the national television station in Syria. *Data*
The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding,
and the transcribed data totals 741,689 tokens. The transcripts were created with
the LDC tool, XTrans, which supports manual transcription and annotation of audio
recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)
verbatim, time-aligned transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR annotation adds structural
information such as topic boundaries and manual sentence unit annotation to the core
components of a quick transcript. Files with QTR as part of the filename were developed
using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637653
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 597-417-124-701-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Arabic Broadcast News Speech Part 1
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Arabic Broadcast News Speech Part 1 was developed by the Linguistic Data
Consortium (LDC) and is comprised of approximately 132 hours of Arabic broadcast news
speech collected in 2007 by the Linguistic Data Consortium (LDC), MediaNet, Tunis,
Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous
Language Exploitation) program. Corresponding transcripts are released as GALE Phase
3 Arabic Broadcast News Transcripts Part 1 (LDC2016T17). Broadcast audio for the GALE
program was collected at LDC’s Philadelphia, PA USA facilities and at three remote
collection sites: Hong Kong University of Science and Technology, Hong King (Chinese),
Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined
local and outsourced broadcast collection supported GALE at a rate of approximately
300 hours per week of programming from more than 50 broadcast sources for a total
of over 30,000 hours of collected broadcast audio over the life of the program. LDC’s
local broadcast collection system is highly automated, easily extensible and robust
and capable of collecting, processing and evaluating hundreds of hours of content
from several dozen sources per day. The broadcast material is served to the system
by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems
(DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television
(CATV) feeds. The mapping between receivers and recorders is dynamic and modular.
All signal routing is performed under computer control, using a 256x64 A/V matrix
switch. Programs are recorded in a high bandwidth A/V format and are then processed
to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized
closed captions (in the case of North American English) and to generate automatic
speech recognition (ASR) output. An overview of the system, the sources recorded and
the configuration of the recording laboratory are contained in the Guidelines for
Broadcast Audio Collection Version 3.0 included in this release. LDC designed a portable
platform for remote broadcast collection. This is a TiVO-style digital video recording
(DVR) system that records two streams of A/V material simultaneously. It supports
analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside
of the United States. It has a small footprint, weighs less than 30 pounds and can
be transported as carry-on luggage. Medianet collected Arabic programming from across
the Gulf region using its internal system and LDC's portable broadcast collection
platform installed in 2008. The portable platform deployed at the Medianet Tunisian
collection facility collected multiple streams of regional Arabic programming from
various sources. MTC collected Arabic programming using its internal collection system.
*Data* The broadcast news recordings in this release feature news broadcasts focusing
principally on current events from the following sources: Abu Dhabi TV, a television
station based in Abu Dhabi, Al Alam News Channel, based in Iran; Al Arabiya, a news
television station based in Dubai; Al Iraqiyah, an Iraqi television station; Aljazeera
, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast
station in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Kuwait
TV, a national broadcast station in Kuwait; Lebanese Broadcasting Corporation, a Lebanese
television station; Nile TV, a broadcast programmer based in Egypt, Saudi TV, a national
television station based in Saudi Arabia; and Syria TV, the national television station
in Syria. This release contains 175 audio files presented in FLAC-compressed Waveform
Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited
by a native Arabic speaker following Audit Procedure Specification Version 2.0 which
is included in this release. The broadcast auditing process served three principal
goals: as a check on the operation of the broadcast collection system equipment by
identifying failed, incomplete or faulty recordings; as an indicator of broadcast
schedule changes by identifying instances when the incorrect program was recorded;
and as a guide for data selection by retaining information about a program’s genre,
data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637696
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 110-972-255-734-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ARL Arabic Dependency Treebank
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ARL Arabic Dependency Treebank was developed by the US Army Research Laboratory (ARL)
and was derived from four LDC resources: Arabic Treebank (ATB) Part 1 v 4.1 (LDC2010T13),
Part 2 v 3.1 (LDC2011T09), Part 3 v 3.2 (LDC2010T08) and Broadcast News v 1.0 (LDC2012T07).
LDC's ATB series follows the constituency or phrase structure approach to treebank
development in which clauses are divided into noun phrases and verb phrases and in
each sentence, one or more nodes may correspond to one element. Dependency grammar,
on the other hand, is based on the idea that the verb is the center of the clause
structure and that other units in the sentence are connected to the verb as directed
links or dependencies. This is a one-to-one correspondence: for every element in the
sentence there is one node in the sentence structure that corresponds to that element.
ARL Arabic Dependency Treebank was generated using constituency-to-dependency software
written at ARL. *Data* The source data in this release consists of Arabic newswire
and broadcast programming collected by LDC from various news and broadcast providers.
The files are in an 11-column tab-separated format with one or more blank lines between
sentences. All files are UTF-8 encoded. Further information about the corpus structure
is contained in the documentation accompanying this release.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tratz, Stephen C.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637688
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 629-964-511-709-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training was developed
by the Linguistic Data Consortium (LDC) and consists of 448,094 words of Chinese and
English parallel text enhanced with linguistic tags to indicate word relations. The
DARPA BOLT (Broad Operational Language Translation) program developed machine translation
and information retrieval for less formal genres, focusing particularly on user-generated
content. LDC supported the BOLT program by collecting informal data sources -- discussion
forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected
data was translated and annotated for various tasks including word alignment, treebanking,
propbanking and co-reference. *Data* This release consists of Chinese source discussion
forum threads harvested from the Internet by LDC using a combination of manual and
automatic processes. The source data is released as BOLT Chinese Discussion Forums
(LDC2016T05). The BOLT word alignment task was built on treebank annotation. Specifically,
LDC automatically extracted Chinese source tokens, including empty categories/traces,
from word-segmented files provided by the BOLT Chinese Treebank annotation team at
Brandeis University. The word-segmented tokens were then used to automatically generate
ctb (Chinese Treebank) alignment and were also tokenized for character alignment by
inserting white spaces to separate characters. The data profile broken down by character
tokens, ctb tokens and segments appears below: Language Genre Files Words CharTokens
CTBTokens Segments Chinese forum 570 448,094 672,140 442,520 20,819 *Acknowledgement*
This material is based upon work supported by the Defense Advanced Research Projects
Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily
reflect the position or the policy of the Government, and no official endorsement
should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Peterson, Katherine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637661
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 253-417-222-839-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Arabic Broadcast News Parallel Sentences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Arabic Broadcast News Parallel Sentences was developed by the Linguistic
Data Consortium (LDC). Along with other corpora, the parallel text in this release
comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Modern Standard Arabic source sentences
and corresponding English translations selected from broadcast news data collected
by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.
*Data* GALE Phase 4 Arabic Broadcast News Parallel Sentences includes 106 source-translation
document pairs, comprising 114,251 words (Arabic source) of translated data. Data
is drawn from 24 distinct Arabic programs featuring news broadcasts. The data was
transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance
with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated
sentence boundaries in addition to transcribing the text. Sentences were selected
for translation in two steps. First, files were chosen using sentence selection scripts
provided by GALE program participants SRI International and IBM. The output was then
manually reviewed by LDC staff to eliminate problematic sentences. Selected files
were reformatted into a human-readable translation format and assigned to translation
vendors. Translators followed LDC's Arabic to English translation guidelines and were
provided with the full source documents containing the target sentences for their
reference. Bilingual LDC staff performed quality control procedures on the completed
translations. Source data and translations are distributed in TDF format. TDF files
are tab-delimited files containing one segment of text along with meta information
about that segment. Each field in the TDF file is described in TDF_format.txt. All
data are encoded in UTF-8. *Acknowledgement* This work was supported in part by the
Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or the policy
of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic, Standard Arabic, and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u pus d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 594-996-615-028-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
pus
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
pus
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY was developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 214 hours of Pashto conversational and scripted telephone speech collected
in 2011 and 2012 along with corresponding transcripts. The Babel program focuses on
underserved languages and seeks to develop speech recognition technology that can
be rapidly applied to any human language to support keyword search performance over
large amounts of recorded speech. *Data* The Pashto speech in this release represents
that spoken in four dialect regions of Afghanistan and Pakistan. The gender distribution
among speakers is approximately 30% female, 70% male; speakers' ages range from 17
years to 70 years. Calls were made using different telephones (e.g., mobile, landline)
from a variety of environments including the street, a home or office, a public place,
and inside a vehicle. All audio data is presented as 8kHz 8-bit a-law encoded audio
in sphere format. Transcripts are available in two versions: an extended Arabic script
and a modified Buckwalter transliteration scheme, both encoded in UTF-8. Further information
about transcription methodology is contained in the documentation accompanying this
release. Evaluation data is available from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Pushto. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Adams, Nikki
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gillies, Breanna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hazen, T.J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jarrett, Amy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Khugyani, Kamila Khan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lin, Willa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strahan, Tania E.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637718
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 722-524-976-246-2
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Richer Event Description
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Richer Event Description was developed by the University of Colorado Boulder-CLEAR
(Computational Language and Education Research, Carnegie Mellon University and LDC.
It consists of coreference, bridging and event-event relations (temporal, causal,
subevent and reporting relations) annotations over 95 English newswire, discussion
forum and narrative text documents, covering all events, times and non-eventive entities
within each document. RED annotation is intended to join different annotation layers
and to provide a rich representation of event phenomena. *Data* Documents were annotated
twice -- in a markable pass and in an event annotation phase. In the markable pass,
events, entities, TIMEX3 and section elements were annotated across the entire document
and features were marked on each markable element. Event relations and event coreference
were then annotated over the adjudicated markables. Further information about the
annotation process is contained in the guidelines accompanying this release. Annotation
and source documents are divided into three partitions: (1) 20 newswire summarization
documents, (2) 20 discussion forum documents and newswire annotations used in the
original RED pilot annotations, and (3) 55 documents annotated by a range of DEFT
(Deep Exploration and Filtering of Test) annotation formats. Data is presented as
UTF-8 encoded xml and plain text.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
O'Gorman, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u tur d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637726
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 039-483-741-269-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5 was developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 213 hours of Turkish conversational and scripted telephone speech collected
in 2012 along with corresponding transcripts. The Babel program focuses on underserved
languages and seeks to develop speech recognition technology that can be rapidly applied
to any human language to support keyword search performance over large amounts of
recorded speech. *Data* The Turkish speech in this release represents that spoken
in seven dialect regions in Turkey. The gender distribution among speakers is approximately
equal; speakers' ages range from 16 years to 70 years. Calls were made using different
telephones (e.g., mobile, landline) from a variety of environments including the street,
a home or office, a public place, and inside a vehicle. All audio data is presented
as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are encoded in UTF-8.
Further information about transcription methodology is contained in the documentation
accompanying this release. Evaluation data is available from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Turkish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Andresen, Jess
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gillies, Breanna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hazen, T.J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jarrett, Amy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Roomi, Bergul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637734
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 859-947-665-680-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
KAFD: Arabic Font Database
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
still image
- Content type code:
sti
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
KAFD: Arabic Font Database was developed by King Fahd University of Petroleum & Minerals
and Qassim University. It is comprised of approximately 2.5 million scanned Arabic
printed pages in a variety of fonts, sizes and resolutions along with corresponding
transcripts. KAFD was designed for research in Arabic text recognition. *Data* The
scanned Arabic texts were collected from publications covering various subjects such
as religion, medicine, science and history. Texts were printed in 40 different fonts,
10 sizes and four styles. Scans were made at 100, 200, 300 and 600 dpi (dots per inch).
The database is available in two formats: at the page level and at the line level.
Images are presented as TIFF images and transcripts are in plain text format. Individual
font folders are compressed into RAR archives. The data is divided into training,
validation and test sets.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Pictures
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Luqman, Hamzah
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mahmoud, Sabri A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Awaida, Sameh
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u geo d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637742
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 886-007-695-912-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
geo
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kat
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a was developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 190 hours of Georgian conversational and scripted telephone speech collected
in 2014 and 2015 along with corresponding transcripts. The Babel program focuses on
underserved languages and seeks to develop speech recognition technology that can
be rapidly applied to any human language to support keyword search performance over
large amounts of recorded speech. *Data* The Georgian speech in this release represents
that spoken in the Eastern and Western dialect regions in Georgia. The gender distribution
among speakers is approximately equal; speakers' ages range from 16 years to 73 years.
Calls were made using different telephones (e.g., mobile, landline) from a variety
of environments including the street, a home or office, a public place, and inside
a vehicle. Audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format
or in 48kHz 24-bit PCM wav format. Transcripts are encoded in UTF-8 using a romanization
scheme developed by Appen. Further information about transcription methodology is
contained in the documentation accompanying this release. Evaluation data is available
from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Georgian. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
David, Anne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hammond, Simon
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gann, Ketty
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hefright, Brook
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kazi, Michael
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lam, Julie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Richardson, Fred
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walter, Marle
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u pol d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637750
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 012-353-003-824-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
pol
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
ukr
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
pol
- Language code of text/sound track or separate title:
rus
- Language code of text/sound track or separate title:
ukr
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multi-Language Conversational Telephone Speech 2011 -- Slavic Group
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multi-Language Conversational Telephone Speech 2011 -- Slavic Group was developed
by the Linguistic Data Consortium (LDC) and is comprised of approximately 60 hours
of telephone speech in each of three distinct Slavic languages: Polish, Russian and
Ukrainian. The data were collected primarily to support research and technology evaluation
in automatic language identification, and portions of these telephone calls were used
in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language
pair discrimination for 24 languages/dialects, some of which could be considered mutually
intelligible or closely related. LDC has released the following as part of the Multi-Language
Conversational Telephone Speech 2011 series: * Turkish (LDC2017S09) * South Asian
(LDC2017S14) * Central Asian (LDC2018S03) *Data* Participants were recruited by native
speakers who contacted acquaintances in their social network. Those native speakers
made one call, up to 15 minutes, to each acquaintance. The data was collected using
LDC's telephone collection infrastructure, comprised of three computer telephony systems.
Human auditors labeled calls for callee gender, dialect type and noise. Demographic
information about the participants was not collected. All audio data are presented
in FLAC-compressed MS-WAV (RIFF) file format (*.flac); when uncompressed, each file
is 2 channels, recorded at 8000 samples/second with samples stored as 16-bit signed
integers, representing a lossless conversion from the original mu-law sample data
as captured digitally from the public telephone network. The following table summarizes
the total number of calls, total number of hours of recorded audio, and the total
size of compressed data: group lng #calls #hours #MB slavic pol 124 28.3 1457 slavic
rus 71 13.1 577 slavic ukr 89 19.0 932 slavic Totals 284 60.4 2966
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Polish, Russian, and Ukrainian. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jones, Karen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637769
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 963-197-049-281-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 and 4 Chinese Newswire Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 and 4 Chinese Newswire Parallel Text was developed by the Linguistic
Data Consortium (LDC). Along with other corpora, the parallel text in this release
comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Chinese source text and corresponding
English translations selected from newswire data collected by LDC between 2007 and
2008 and translated by LDC or under its direction. *Data* GALE Phase 3 and 4 Chinese
Newswire Parallel Text includes 367 source-translation document pairs, comprising
210,048 tokens of Chinese source text and its English translation. Data is drawn from
five distinct Chinese newswire sources. Data was manually selected for translation
according to several criteria, including linguistic features and topic features. The
files were formatted into a human-readable translation format and assigned to translation
vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual
LDC staff performed quality control procedures on the completed translations. Source
data and translations are distributed in TDF format. TDF files are tab-delimited files
containing one segment of text along with meta information about that segment. Each
field in the TDF file is described in TDF_format.txt. All data are encoded in UTF-8.
*Acknowledgement* This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication
does not necessarily reflect the position or the policy of the Government, and no
official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Chinese, Mandarin Chinese, and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637785
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T27
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 289-611-404-122-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Arabic Newswire Parallel Sentences
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T27
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Arabic Newswire Parallel Sentences was developed by the Linguistic Data
Consortium (LDC). Along with other corpora, the parallel text in this release comprised
training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source sentences and corresponding
English translations selected from newswire data collected by LDC in 2008 and translated
by LDC or under its direction. *Data* GALE Phase 4 Arabic Newswire Parallel Sentences
includes 393 source-translation document pairs, comprising 62,669 words (Arabic source)
of translated data. Data is drawn from six distinct Arabic newswire sources. Sentences
were selected for translation in two steps. First, files were chosen using sentence
selection scripts provided by GALE program participants SRI International and IBM.
The output was then manually reviewed by LDC staff to eliminate problematic sentences.
Selected files were reformatted into a human-readable translation format and assigned
to translation vendors. Translators followed LDC's Arabic to English translation guidelines
and were provided with the full source documents containing the target sentences for
their reference. Bilingual LDC staff performed quality control procedures on the completed
translations. Source data and translations are distributed in TDF format. TDF files
are tab-delimited files containing one segment of text along with meta information
about that segment. Each field in the TDF file is described in TDF_format.txt. All
data are encoded in UTF-8. *Acknowledgement* This work was supported in part by the
Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or the policy
of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic, Standard Arabic, and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T27
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u tgl d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637793
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016S13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 934-396-101-948-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
tgl
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tgl
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016S13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g was developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 213 hours of Tagalog conversational and scripted telephone speech collected
in 2012 along with corresponding transcripts. The Babel program focuses on underserved
languages and seeks to develop speech recognition technology that can be rapidly applied
to any human language to support keyword search performance over large amounts of
recorded speech. *Data* The Tagalog speech in this release represents that spoken
in the North, Central and South dialect regions in the Philippines. The gender distribution
among speakers is approximately equal; speakers' ages range from 16 years to 65 years.
Calls were made using different telephones (e.g., mobile, landline) from a variety
of environments including the street, a home or office, a public place, and inside
a vehicle. Audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format.
Transcripts are encoded in UTF-8. Further information about transcription methodology
is contained in the documentation accompanying this release. Evaluation data is available
from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Tagalog. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Conners, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gillies, Breanna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hazen, T.J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jarrett, Amy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lin, Willa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Molina, María Encarnación Pérez
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rafalko, Shawna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016S13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637807
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T26
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 546-386-811-027-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation
Data 2012-2014
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T26
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TAC KBP Spanish Cross-Lingual Entity Linking - Comprehensive Training and Evaluation
Data 2012-2014 was developed by the Linguistic Data Consortium (LDC) and contains
training and evaluation data produced in support of the TAC KBP Spanish Cross-lingual
Entity Linking tasks in 2012, 2013 and 2014. It includes queries and gold standard
entity type information, Knowledge Base links, and equivalence class clusters for
NIL entities along with the source documents for the queries, specifically, English
and Spanish newswire, discussion forum and web data. The corresponding knowledge base
is available as TAC KBP Reference Knowledge Base (LDC2014T16). Text Analysis Conference
(TAC) is a series of workshops organized by the National Institute of Standards and
Technology (NIST). TAC was developed to encourage research in natural language processing
and related applications by providing a large test collection, common evaluation procedures,
and a forum for researchers to share their results. Through its various evaluations,
the Knowledge Base Population (KBP) track of TAC encourages the development of systems
that can match entities mentioned in natural texts with those appearing in a knowledge
base and extract novel information about entities from a document collection and add
it to a new or existing knowledge base. Spanish cross-lingual entity linking was first
conducted as part of the 2012 TAC KBP evaluations. The track was an extension of the
monolingual English Entity Linking track (EL) whose goal was to measure systems' ability
to determine whether an entity, specified by a query, had a matching node in a reference
knowledge base (KB) and, if so, to create a link between the two. If there was no
matching node for a query entity in the KB, EL systems were required to cluster the
mention together with others referencing the same entity. More information about the
TAC KBP Entity Linking task and other TAC KBP evaluations can be found on the NIST
TAC website. *Data* All source documents were originally released as XML but have
been converted to text files for this release. This change was made primarily because
the documents were used as text files during data development but also because some
fail XML parsing. *Acknowledgement* This material is based on research sponsored by
Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement
number FA8750-13-2-0045. The U.S. Government is authorized to reproduce and distribute
reprints for Governmental purposes notwithstanding any copyright notation thereon.
The views and conclusions contained herein are those of the authors and should not
be interpreted as necessarily representing the official policies or endorsements,
either expressed or implied, of Air Force Research Laboratory and Defense Advanced
Research Projects Agency or the U.S. Government.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ellis, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Getman, Jeremy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T26
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u bam d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637815
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016L01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 830-816-122-814-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
bam
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
fre
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
bam
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
fra
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Bamanankan Lexicon
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016L01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Bamanankan Lexicon was developed by the Linguistic Data Consortium (LDC) and contains
5,978 entries of the Bamanankan language presented as a Bamanankan-English lexicon
and a Bamanankan-French lexicon. It is the third publication in an LDC project to
build an electronic dictionary of three Mandekan languages: Mawukakan, Maninkakan
and Bamanankan. These are Eastern Manding languages in the Mande Group of the Niger-Congo
language family. LDC released a Mawukakan Lexicon (LDC2005L01) in 2005 and a Maninkakan
Lexicon (LDC2013L01) in 2013. There are approximately 15 million Bamanankan speakers
(four million in Mali, and ten million in the West African region who speak Bamanankan
as a second language.) The number of speakers of the different Mandekan dialects is
approximately 30 to 40 million, mainly in Mali, Burkina Faso, Senegal, Guinea Bissau,
Guinea, Liberia, Ivory Coast, Sierra Leone and Gambia. Bamanankan is the most studied
among the Mandekan languages, due to the fact that it is spoken as a first or second
language by at least 80% of the population of Mali and widely as a second or third
language in most of West Africa. More information about LDC's work in the languages
of West Africa and the challenges those languages present for language resource development
can be found here. *Data* This lexicon is presented using a Latin-based transcription
system because the Latin alphabet is familiar to the majority of Mandekan language
speakers and it is expected to facilitate the work of researchers interested in this
resource. The dictionary is provided in two formats, Toolbox and XML. Toolbox is a
version of the widely used SIL Shoebox program adapted to display Unicode. Toolbox
can be downloaded for free from this link. The Toolbox files are provided in two fonts,
Arial and Doulos SIL. The Arial files should display using the Arial font which is
standard on most operating systems. Doulos SIL, available as a free download, is a
robust font that should display all characters without issue. Users should launch
Toolbox using the *.prj files in the Arial or Doulos_SIL folders. The lexicon is presented
in Unicode Normalization Form D, canonical decomposition. This means that all glyphs
are divided into as many parts as possible. See the following link for more information
on Unicode normalization forms. The XML formatted lexicon was generated by Toolbox
and a DTD is included. *Acknowledgement* Meghan Glenn served as an editor for the
French and English parts of this Lexicon.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Bambara, English, and French. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bamba, Moussa
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016L01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637823
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 484-489-597-064-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
MWE-Aware English Dependency Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
MWE-Aware English Dependency Corpus was developed by the Nara Institute of Science
and Technology Computational Linguistics Laboratory and consists of English compound
function words annotated in dependency format. The data is derived from OntoNotes
Release 5.0 (LDC2013T19). Compound function words are a type of multiword expression
(MWE). MWEs are groups of tokens that can be treated as a single semantic or syntactic
unit. Doing so facilitates natural language processing tasks such as constituency
and dependency parsing. Version 2.0 is available from LDC as MWE-Aware English Dependency
Corpus 2.0 (LDC2017T16) *Data* MWE-Aware English Dependency Corpus was derived from
the Wall Street Journal portion of OntoNotes Release 5.0. MWEs were identified in
OntoNotes' phrase structure trees and each MWE was established as a single subtree.
Those phrase structure subtrees were then converted to a dependency structure (the
Stanford dependencies) in CoNLL format. The data is split into 1,728 phrase structure
trees as *.parse files and a single 14-column tab separated dependency as a *.conll
file. Both file types are encoded as UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kato, Akihiko
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shindo, Hiroyuki
ADDED ENTRY--PERSONAL NAME
- Personal name:
Matsumoto, Yuji
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637831
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017L01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 445-866-322-325-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Arabic Speech Recognition Pronunciation Dictionary
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017L01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Arabic Speech Recognition Pronunciation Dictionary was developed by the Qatar Computing
Research Institute. It contains approximately two million pronunciation entries for
526,000 Modern Standard Arabic words, for an average of 3.84 pronunciations for each
grapheme word. *Data* The dictionary was developed from news archive resources, including
the Arabic news website Aljazeera.net. The selected words were those that occurred
more than once in the news collection. The text was processed using MADA. The dictionary
is presented in a single UTF-8 plain text file.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic and Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ali, Ahmed
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017L01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u vie d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 401-277-958-467-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
vie
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
vie
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 was developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 201 hours of Vietnamese conversational and scripted telephone speech
collected in 2012 along with corresponding transcripts. The Babel program focuses
on underserved languages and seeks to develop speech recognition technology that can
be rapidly applied to any human language to support keyword search performance over
large amounts of recorded speech. *Data* The Vietnamese speech in this release represents
that spoken in the North, North-Central, Central and Southern dialect regions in Vietnam.
The gender distribution among speakers is approximately equal; speakers' ages range
from 16 years to 64 years. Calls were made using different telephones (e.g., mobile,
landline) from a variety of environments including the street, a home or office, a
public place, and inside a vehicle. Audio data is presented as 8kHz 8-bit a-law encoded
audio in sphere format. Transcripts are encoded in UTF-8. Further information about
transcription methodology is contained in the documentation accompanying this release.
Evaluation data is available from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Vietnamese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Andrus, Tony
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Corris, Miriam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gillies, Breanna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hazen, T.J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hefright, Brook
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jarrett, Amy
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silber, Ronnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637858
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 660-719-239-718-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 and 4 Chinese Web Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 and 4 Chinese Web Parallel Text was developed by the Linguistic Data
Consortium (LDC). Along with other corpora, the parallel text in this release comprised
training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Chinese source text and corresponding English translations
selected from weblog and newsgroup data collected by LDC and translated by LDC or
under its direction. *Data* GALE Phase 3 and 4 Chinese Web Parallel Text includes
88 source-translation document pairs, comprising 67,514 tokens of Chinese source text
and its English translation. Data was manually selected for translation according
to several criteria, including linguistic features and topic features. The files were
formatted into a human-readable translation format and assigned to translation vendors.
Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC
staff performed quality control procedures on the completed translations. Source data
and translations are distributed in TDF format. TDF files are tab-delimited files
containing one segment of text along with meta information about that segment. Each
field in the TDF file is described in TDF_format.txt. All data are encoded in UTF-8.
*Acknowledgement* This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication
does not necessarily reflect the position or the policy of the Government, and no
official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Chinese, Mandarin Chinese, and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Krug, Gary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637866
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 141-827-463-794-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
First-Year Law Students' Court Memoranda
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
First-Year Law Students' Court Memoranda consists of 197 English law student writing
samples of legal briefs annotated for certain characteristics along with accompanying
survey responses by the student writers. The briefs were created in a law school writing
class at two law schools in the US Midwest during the 2011-12 academic year. Students
who agreed to participate in this study uploaded their briefs to an online survey
instrument and answered questions regarding their age, gender, level of education,
most recent writing course and method of learning English. The study's purpose was
to apply natural language processing approaches to determine any differences in the
briefs' language attributable to the students' self-reported genders. *Data* The writings
are the year-end memoranda of law to a court required in the two legal writing classes.
All students were writing in the same genre and in many instances, on the same hypothetical
legal case. The samples were imported into the General Architecture for Text Engineering
(GATE) and annotated by two human coders who identified large text segments specific
to the legal genre in which the students wrote, such as text headings, citations,
block quotes and footnotes. Writing samples are presented as MS Word documents and
annotations and survey responses are presented in XML format. The data has been anonymized
to remove names and other identifying information about the student participants.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Larson, Brian N.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 280-113-850-942-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Chinese-English Parallel Sentences Extracted from Patents
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Chinese-English Parallel Sentences Extracted from Patents was developed by Chilin
(HK) Limited and contains 500,000 sentence pairs of Chinese-English parallel text.
This resource is based on the training corpus and test sets developed for the Tokyo-based
NTCIR 2009 & 2010 tasks on Patent Machine Translation. *Data* The sentences in this
release were selected from a larger corpus of than 300,000 Chinese-English parallel
patents in different fields according to a number of filtering parameters including
word alignment, sentence length and language modeling. They were then automatically
segmented and aligned. All text is encoded as UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tsou, Benjamin
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chow, Kapo
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2016 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637777
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2016T24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 498-037-802-860-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
JANA: A Human-Human Dialogues Corpus for Egyptian Dialect
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2016]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2016T24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
JANA: A Human-Human Dialogues Corpus for Egyptian Dialect was developed by researchers
at Cairo University. It consists of 82 transcribed dialogues from call center inquiries
annotated for dialogue acts. Data was collected from call centers for banks, airlines
and mobile network providers as follows: (1) spontaneous spoken dialogues from inquiries
to banks and airlines; and (2) instant messaging (chat) dialogues from a mobile network
provider's online support system. *Data* The transcribed dialogues consist of 52 telephone
calls and 30 instant messaging conversations, amounting to approximately 20,311 words.
The data contains roughly 3,001 conversation turns, with an average of 6.7 words per
turn, and 4,725 utterances, with an average of 4.3 words per utterance. The data was
transcribed using Transcriber. All data is presented as UTF-8 XML.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic and Egyptian Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Elmadany, AbdelRahim A.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Abdou, Sherif M.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gheith, Mervat
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2016T24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637890
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 141-827-463-794-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Arabic Broadcast News Speech Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Arabic Broadcast News Speech Part 2 was developed by the Linguistic Data
Consortium (LDC) and is comprised of approximately 128 hours of Arabic broadcast news
speech collected in 2007 by the Linguistic Data Consortium (LDC), MediaNet, Tunis,
Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous
Language Exploitation) program. Corresponding transcripts are released as GALE Phase
3 Arabic Broadcast News Transcripts Part 2 (LDC2017T04). Broadcast audio for the GALE
program was collected at LDC’s Philadelphia, PA USA facilities and at three remote
collection sites: Hong Kong University of Science and Technology, Hong Kong (Chinese),
Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined
local and outsourced broadcast collection supported GALE at a rate of approximately
300 hours per week of programming from more than 50 broadcast sources for a total
of over 30,000 hours of collected broadcast audio over the life of the program. LDC’s
local broadcast collection system is highly automated, easily extensible and robust
and capable of collecting, processing and evaluating hundreds of hours of content
from several dozen sources per day. The broadcast material is served to the system
by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems
(DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television
(CATV) feeds. The mapping between receivers and recorders is dynamic and modular.
All signal routing is performed under computer control, using a 256x64 A/V matrix
switch. Programs are recorded in a high bandwidth A/V format and are then processed
to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized
closed captions (in the case of North American English) and to generate automatic
speech recognition (ASR) output. An overview of the system, the sources recorded and
the configuration of the recording laboratory are contained in the Guidelines for
Broadcast Audio Collection Version 3.0 included in this release. LDC designed a portable
platform for remote broadcast collection. This is a TiVO-style digital video recording
(DVR) system that records two streams of A/V material simultaneously. It supports
analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside
of the United States. It has a small footprint, weighs less than 30 pounds and can
be transported as carry-on luggage. Medianet collected Arabic programming from across
the Gulf region using its internal system and LDC's portable broadcast collection
platform installed in 2008. The portable platform deployed at the Medianet Tunisian
collection facility collected multiple streams of regional Arabic programming from
various sources. MTC collected Arabic programming using its internal collection system.
*Data* The recordings in this release feature news broadcasts focusing principally
on current events from the following sources: Abu Dhabi TV, United Arab Emirates;
Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in
Dubai; Al Iraqiyah, an Iraqi television station; Aljazeera, a regional broadcaster
located in Doha, Qatar; Al-Manar TV, a broadcast programmer located in Lebanon; Al
Ordiniyah, a national broadcast station in Jordan; Al Sharqiya, an Iraqi television
station; Dubai TV, a broadcast station in the United Arab Emirates; Kuwait TV, a national
broadcast station in Kuwait; Nile TV, a broadcast programmer based in Egypt; Oman
TV, a national broadcaster located in the Sultanate of Oman; Saudi TV, a national
television station based in Saudi Arabia; and Syria TV, the national television station
in Syria. This release contains 175 audio files presented in FLAC-compressed Waveform
Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited
by a native Arabic speaker following Audit Procedure Specification Version 2.0 which
is included in this release. The broadcast auditing process served three principal
goals: as a check on the operation of the broadcast collection system equipment by
identifying failed, incomplete or faulty recordings; as an indicator of broadcast
schedule changes by identifying instances when the incorrect program was recorded;
and as a guide for data selection by retaining information about a program’s genre,
data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637882
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 539-362-793-352-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 3 Arabic Broadcast News Transcripts Part 2
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 3 Arabic Broadcast News Transcripts Part 2 was developed by the Linguistic
Data Consortium (LDC) and contains transcriptions of approximately 128 hours of Arabic
broadcast news speech collected in 2007 by the Linguistic Data Consortium (LDC), MediaNet,
Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous
Language Exploitation) program. Corresponding audio data is released as GALE Phase
3 Arabic Broadcast News Speech Part 2 (LDC2017S02). The recordings for transcription
feature news broadcasts focusing primarily on current events from the following sources:
Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Al Arabiya,
a news television station based in Dubai; Al Iraqiyah, an Iraqi television station;
Aljazeera, a regional broadcaster located in Doha, Qatar; Al-Manar TV, a broadcast
programmer located in Lebanon; Al Ordiniyah, a national broadcast station in Jordan;
Al Sharqiya, an Iraqi television station; Dubai TV, a broadcast station in the United
Arab Emirates; Kuwait TV, a national broadcast station in Kuwait; Nile TV, a broadcast
programmer based in Egypt; Oman TV, a national broadcaster located in the Sultanate
of Oman; Saudi TV, a national television station based in Saudi Arabia; and Syria
TV, the national television station in Syria. *Data* The transcript files are in plain-text,
tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 721,846
tokens. The transcripts were created with the LDC tool, XTrans, which supports manual
transcription and annotation of audio recordings. XTrans is available from the following
link, https://www.ldc.upenn.edu/language-resources/tools/xtrans. The files in this
corpus were transcribed by LDC staff and/or by transcription vendors under contract
to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick
rich transcription specification (QRTR) both of which are included in the documentation
with this release. QTR transcription consists of quick (near-) verbatim, time-aligned
transcripts plus speaker identification with minimal additional mark-up. It does not
include sentence unit annotation. QRTR annotation adds structural information such
as topic boundaries and manual sentence unit annotation to the core components of
a quick transcript. Files with QTR as part of the filename were developed using QTR
transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u hat d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637874
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 763-119-338-310-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
hat
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
hat
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b was developed by Appen
for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It
contains approximately 203 hours of Haitian Creole conversational and scripted telephone
speech collected in 2012 and 2013 along with corresponding transcripts. The Babel
program focuses on underserved languages and seeks to develop speech recognition technology
that can be rapidly applied to any human language to support keyword search performance
over large amounts of recorded speech. *Data* The Haitian Creole speech in this release
represents that spoken in the Northern, Western and Southern dialect regions in Haiti.
The gender distribution among speakers is approximately equal; speakers' ages range
from 16 years to 75 years. Calls were made using different telephones (e.g., mobile,
landline) from a variety of environments including the street, a home or office, a
public place, and inside a vehicle. Audio data is presented as 8kHz 8-bit a-law encoded
audio in sphere format or 48kHz 24-bit PCM encoded audio in wav format. Transcripts
are encoded in UTF-8. Further information about transcription methodology is contained
in the documentation accompanying this release. Evaluation data is available from
NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Haitian. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Andrus, Tony
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Conners, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Crabb, Erin Smith
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gillies, Breanna
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hazen, T.J.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hefright, Brook
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jarrett, Amy
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silber, Ronnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637939
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 107-834-092-668-3
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Noisy TIMIT Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Noisy TIMIT Speech was developed by the Florida Institute of Technology and contains
approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech
Corpus (LDC93S1) modified with different additive noise levels. Only the audio has
been modified; the original arrangement of the TIMIT corpus is still as described
by the TIMIT documentation. *Data* The additive noise are white, pink, blue, red,
violet and babble noise with noise levels varying in 5 dB (decibel) steps and ranges
from 5 to 50 dB. The color of noise refers to the power spectrum of a noise signal.
Sound waves have two characteristics: frequency, which describes how fast the waveform
vibrates per second; and amplitude, the size of the waveform. Colored noises are named
in an analogy to the colors of light. For instance, white noise contains all audible
frequencies just as white light contains all frequencies in the visible range. Non-white
colored noises have more energy concentrated at the high or low end of the sound spectrum.
White, pink and blue noise are officially defined in the federal telecommunications
standard. The white, pink, blue, red and violet noise types added to the TIMIT data
in this release were generated artificially using MATLAB. For the babble noise, a
random segment of recorded babble speech was selected and scaled relative to the power
of the original TIMIT audio signal. All audio files are presented as single channel
16kHz 16-flac.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Abdulaziz, Azhar
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kepuska, Veton
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u swa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637904
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 874-256-867-958-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
swa
- Language code of text/sound track or separate title:
swa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
swh
- Language code of text/sound track or separate title:
swa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d was developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 200 hours of Swahili conversational and scripted telephone speech collected
from 2012-2014 along with corresponding transcripts. The Babel program focuses on
underserved languages and seeks to develop speech recognition technology that can
be rapidly applied to any human language to support keyword search performance over
large amounts of recorded speech. *Data* The Swahili speech in this release represents
that spoken in the Nairobi dialect region of Kenya. The gender distribution among
speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls
were made using different telephones (e.g., mobile, landline) from a variety of environments
including the street, a home or office, a public place, and inside a vehicle. Audio
data is presented as 8kHz 8-bit a-law encoded audio in sphere format or 48kHz 24-bit
PCM encoded audio in wav format. Transcripts are encoded in UTF-8. Further information
about transcription methodology is contained in the documentation accompanying this
release. Evaluation data is available from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Swahili (individual language) and Swahili. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Andresen, Jess
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Conners, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kozlov, Kirill
ADDED ENTRY--PERSONAL NAME
- Personal name:
Malyska, Nicolas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Melot, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morrison, Michelle
ADDED ENTRY--PERSONAL NAME
- Personal name:
Phillips, Josh
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silber, Ronnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wong, Jamie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637912
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 842-120-979-982-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BOLT Chinese Discussion Forum Parallel Training Data
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
BOLT Chinese Discussion Forum Parallel Training Data was developed by the Linguistic
Data Consortium (LDC) and consists of 1,876,799 tokens of Chinese discussion forum
data collected for the DARPA BOLT program along with their corresponding English translations.
The BOLT (Broad Operational Language Translation) program developed machine translation
and information retrieval for less formal genres, focusing particularly on user-generated
content. LDC supported the BOLT program by collecting informal data sources -- discussion
forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected
data was translated and annotated for various tasks including word alignment, treebanking,
propbanking and co-reference. *Data* The source data in this release consists of discussion
forum threads harvested from the Internet by LDC using a combination of manual and
automatic processes. The full source data collection is released as BOLT Chinese Discussion
Forums (LDC2016T05). Word-aligned and tagged data is released as BOLT Chinese-English
Word Alignment and Tagging - Discussion Forum Training (LDC2016T19). Data was manually
selected for translation according to several criteria, including linguistic features
and topic features. The files were then segmented into sentence units, formatted into
a human-readable translation format and assigned to translation vendors. Translators
followed LDC's BOLT translation guidelines. Bilingual LDC staff performed quality
control procedures on the completed translations. All data are presented as UTF-8.
The following table shows the data volume of this package: Source Lang Genre Files
Source Tokens Target Tokens Chinese Discussion Forum 1,541 1,876,779 1,557,873 *Acknowledgement*
This material is based upon work supported by the Defense Advanced Research Projects
Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily
reflect the position or the policy of the Government, and no official endorsement
should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese, English, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garland, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637920
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 616-843-327-054-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE English-Chinese Parallel Aligned Treebank -- Training
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE English-Chinese Parallel Aligned Treebank -- Training was developed by the Linguistic
Data Consortium (LDC) and contains 196,123 tokens of word aligned English and Chinese
parallel text with treebank annotations. This material was used as training data in
the DARPA GALE (Global Autonomous Language Exploitation) program. Parallel aligned
treebanks are treebanks annotated with morphological and syntactic structures aligned
at the sentence level and the sub-sentence level. Such data sets are useful for natural
language processing and related fields, including automatic word alignment system
training and evaluation, transfer-rule extraction, word sense disambiguation, translation
lexicon extraction and cultural heritage and cross-linguistic studies. With respect
to machine translation system development, parallel aligned treebanks may improve
system performance with enhanced syntactic parsers, better rules and knowledge about
language pairs and reduced word error rate. The English source data was translated
into Chinese. Chinese and English treebank annotations were performed independently.
The parallel texts were then word aligned. The material in this release corresponds
to portions of the treebanked data in OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0
(LDC2011T03). *Data* This release consists of English source broadcast programming
(CNN, NBC/MSNBC) and web data collected by LDC in 2005 and 2006. The distribution
by genre, words, character tokens, treebank tokens and segments appears below: Genre
Files Words CharTokens CTBTokens Segments bc 6 60,0061 90,092 62,438 3,763 wb 15 70,687
106,031 69,309 3,238 Total 21 130,748 196,123 131,747 7,001 Note that all token counts
are based on the Chinese data only. One token is equivalent to one character and one
word is equivalent to 1.5 characters. The word alignment task consisted of the following
components: * Identifying, aligning, and tagging eight different types of links *
Identifying, attaching, and tagging local-level unmatched words * Identifying and
tagging sentence/discourse-level unmatched words * Identifying and tagging all instances
of Chinese 的 (DE) except when they were a part of a semantic link This release contains
nine types of files - English raw source files, Chinese raw translation files, Chinese
character tokenized files, Chinese CTB tokenized files, English tokenized files, Chinese
treebank files, English treebank files, character-based word alignment files, and
CTB-based word alignment files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Mandarin Chinese, and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Li, Xuansong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Grimes, Stephen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcus, Mitch
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taylor, Ann
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u lao d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637947
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 727-478-824-788-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
lao
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
lao
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a was developed by Appen for the
IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 207 hours of Lao conversational and scripted telephone speech collected
in 2013 along with corresponding transcripts. The Babel program focuses on underserved
languages and seeks to develop speech recognition technology that can be rapidly applied
to any human language to support keyword search performance over large amounts of
recorded speech. *Data* The Lao speech in this release represents that spoken in the
Vientiane dialect region in Laos. The gender distribution among speakers is approximately
equal; speakers' ages range from 16 years to 60 years. Calls were made using different
telephones (e.g., mobile, landline) from a variety of environments including the street,
a home or office, a public place, and inside a vehicle. Audio data is presented as
8kHz 8-bit a-law encoded audio in sphere format and 48kHz 24-bit PCM encoded audio
in wav format. Transcripts are encoded in UTF-8. The romanization scheme was developed
by Appen and was based on the scheme developed by the American Library Association
and Library of Congress. Further information about transcription methodology is contained
in the documentation accompanying this release. Evaluation data is available from
NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Lao. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Benowitz, Daniel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Conners, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Heighway, Melanie
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Melot, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Onaka, Akiko
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silber, Ronnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637955
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 429-091-121-265-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2010 NIST Speaker Recognition Evaluation Test Set
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2010 NIST Speaker Recognition Evaluation Test Set was developed by the Linguistic
Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It
contains 2,255 hours of American English telephone speech and speech recorded over
a microphone channel involving an interview scenario used as test data in the NIST-sponsored
2010 Speaker Recognition Evaluation (SRE). The ongoing series of SRE yearly evaluations
conducted by NIST are intended to be of interest to researchers working on the general
problem of text independent speaker recognition. To this end the evaluations are designed
to be simple, to focus on core technology issues, to be fully supported and to be
accessible to those wishing to participate. The 2010 evaluation was similar to the
2008 evaluation by including in the training and test conditions for the core test
not only conversational telephone speech (CTS) recorded over ordinary telephone channels,
but also CTS and conversational interview speech recorded over a room microphone channel.
Unlike prior evaluations, some of the conversational telephone style speech was collected
in a manner to produce particularly high, or particularly low, vocal effort on the
part of the speaker of interest. *Data* The speech recordings in this release were
collected in 2009 and 2010 by LDC at its Human Subjects Collection facility in Philadelphia.
This collection was part of the Mixer 6 project, which was designed to support the
development of robust speaker recognition technology by providing carefully collected
and audited speech from a large pool of speakers recorded simultaneously across numerous
microphones. The telephone speech segments include two-channel excerpts of approximately
10 seconds and 5 minutes. There are also summed-channel excerpts in the range of 5
minutes. The microphone excerpts are 3-15 minutes in duration. As in prior evaluations,
intervals of silence were not removed. The data included in this release is 8 bit
ulaw with a sample rate of 8000. In addition to evaluation data, this package also
consists of answer keys, trial and train files, development data and evaluation documentation.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Greenberg, Craig
ADDED ENTRY--PERSONAL NAME
- Personal name:
Martin, Alvin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Brandschain, Linda
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637963
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 134-467-387-379-1
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CHiME2 Grid was developed as part of The 2nd CHiME Speech Separation and Recognition
Challenge and contains approximately 120 hours of English speech from a noisy living
room environment. The CHiME Challenges focus on distant-microphone automatic speech
recognition (ASR) in real-world environments. CHiME2 Grid reflects the small vocabulary
track of the CHiME2 Challenge. The target utterances were taken from the Grid corpus
and consist of 34 speakers reading simple 6-word sequences. LDC also released CHiME2
WSJ0 (LDC2017S10) and CHiME3 (LDC2017S24). *Data* Data is divided into training, development
and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The noisy
utterances are provided both in isolated form and in embedded form. The latter either
involve five seconds of background noise before and after the utterance (in the training
set) or they are mixed in continuous five minute noise background recordings (in the
development and test sets). Seven hours of noise background not part of the training
set are also included. The data is accompanied by one annotation file per speaker
that includes additional technical information. Also included is a baseline Hidden
Markov Model (HMM)-based speech recogniser and a scoring tool designed for the 2nd
CHiME Challenge to allow users to obtain keyword recognition scores from formatted
result files, perform recognition and score the challenge data, and estimate parameters
of speaker dependent HMMs.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Vincent, Emmanuel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Barker, Jon
ADDED ENTRY--PERSONAL NAME
- Personal name:
Watanabe, Shinji
ADDED ENTRY--PERSONAL NAME
- Personal name:
Le Roux, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Nesta, Francesco
ADDED ENTRY--PERSONAL NAME
- Personal name:
Matassoni, Marco
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637971
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 573-046-449-233-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arz
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BOLT Egyptian Arabic SMS/Chat and Transliteration
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
BOLT Egyptian Arabic SMS/Chat and Transliteration was developed by the Linguistic
Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS)
and Chat (CHT) data collected through data donations and live collection involving
native speakers of Egyptian Arabic. The corpus contains 5,691 conversations totaling
1,029,248 words across 262,026 messages. Messages were natively written in either
Arabic orthography or romanized Arabizi. A total of 1,856 Arabizi conversations (287,022
words) were transliterated from the original romanized Arabizi script into standard
Arabic orthography. The BOLT (Broad Operational Language Translation) program developed
machine translation and information retrieval for less formal genres, focusing particularly
on user-generated content. LDC supported the BOLT program by collecting informal data
sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic
and English. The collected data was translated and annotated for various tasks including
word alignment, treebanking, propbanking and co-reference. *Data* The data in this
release was collected using two methods: new collection via LDC's collection platform,
and donation of SMS or chat archives from BOLT collection participants. All data collected
were reviewed manually to exclude any messages/conversations that were not in the
target language or that had sensitive content, such as personal identifying information
(PII). A portion of the source conversations containing Arabizi tokens were automatically
transliterated into Arabic script. Once the Arabizi source was transliterated into
Arabic script automatically, LDC annotators reviewed, corrected and normalized the
transliteration according to "Conventional Orthography for Dialectal Arabic" (CODA).
All data is presented in XML. *Acknowledgement* This material is based upon work supported
by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145.
The content does not necessarily reflect the position or the policy of the Government,
and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Egyptian Arabic and Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fore, Dana
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wright, Jonathan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 052-688-100-874-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Phrase Detectives Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Phrase Detectives Corpus was developed by the School of Computer Science and Electronic
Engineering at the University of Essex and consists of approximately 19,012 words
across 40 documents anaphorically-annotated by the Phrase Detectives game, an online
interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric
coreference. GWAPs for creating language resources are growing. In general, they employ
non-monetary incentives, such as entertainment, to motivate participation and can
be successful for large-scale persistent annotation efforts. *Data* The documents
in the corpus are taken from Wikipedia articles and from narrative text in Project
Gutenberg. Wikipedia articles and annotation files are presented as XML and Project
Gutenberg source files are presented as plain text. All text is encoded as UTF-8.
Annotations are comprised of a gold standard version created by multiple experts,
as well as a set created by a large non-expert crowd (via the Phase Detectives game).
The data was annotated according to a prevalent linguistically-oriented approach for
anaphora used in several tasks, including OntoNotes Release 5.0 (LDC2013T19), SemEval-2010
Task 1 Ontonotes English: Coreference Resolution in Multiple Languages (LDC2011T01)
and The ARRAU Corpus of Anaphoric Information (LDC2013T22).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chamberlain, Jon
ADDED ENTRY--PERSONAL NAME
- Personal name:
Poesio, Massimo
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kruschwitz, Udo
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638005
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 173-931-115-382-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
The EventStatus Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
The EventStatus Corpus was developed by researchers at Texas A&M University, Stanford
University and The University of Utah. It consists of approximately 3,000 English
and 1,500 Spanish news articles about civil unrest events annotated with temporal
tags. This corpus was designed to support the study of the temporal and aspectual
properties of major events, that is, whether an event has already happened, is currently
happening or may happen in the future. Since it focuses on a single domain (civil
unrest events), it may be appropriate for tasks such as event extraction and temporal
question answering. *Data* The relevant news articles were sourced from English Gigaword
Fifth Edition (LDC2017T09) and Spanish Gigaword Third Edition (LDC2011T12). The civil
unrest events include protests, demonstrations, marches and strikes. The data was
annotated as PAST, ON-GOING or FUTURE and within each of those categories, as PLANNED,
ALERT or POSSIBLE. In addition to the annotated articles, file lists used in experiments
for tuning and test are included. 10-fold cross-validations were performed, and the
specific 10-fold splits of the test are included as well. All text is presented as
plain text and encoded in UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Huang, Ruihong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jurafsky, Daniel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Riloff, Ellen
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u tur d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585637998
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 466-022-433-410-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multi-Language Conversational Telephone Speech 2011 -- Turkish
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multi-Language Conversational Telephone Speech 2011 -- Turkish was developed by the
Linguistic Data Consortium (LDC) and is comprised of approximately 18 hours of telephone
speech in Turkish. The data were collected primarily to support research and technology
evaluation in automatic language identification, and portions of these telephone calls
were used in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused
on language pair discrimination for 24 languages/dialects, some of which could be
considered mutually intelligible or closely related. LDC has released the following
as part of the Multi-Language Conversational Telephone Speech 2011 series: * Slavic
Group (LDC2016S11) * South Asian (LDC2017S14) * Central Asian (LDC2018S03) *Data*
Participants were recruited by native speakers who contacted acquaintances in their
social network. Those native speakers made one call, up to 15 minutes, to each acquaintance.
The data was collected using LDC's telephone collection infrastructure, comprised
of three computer telephony systems. Human auditors labeled calls for callee gender,
dialect type and noise. Demographic information about the participants was not collected.
All audio data are presented in FLAC-compressed MS-WAV (RIFF) file format (*.flac);
when uncompressed, each file is 2 channels, recorded at 8000 samples/second with samples
stored as 16-bit signed integers, representing a lossless conversion from the original
mu-law sample data as captured digitally from the public telephone network. The following
table summarizes the total number of calls, total number of hours of recorded audio,
and the total size of compressed data: group lng #calls #hours #MB turkish tur 87
18.6 975
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Turkish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jones, Karen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017V01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 810-731-329-467-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
UCLA High-Speed Laryngeal Video and Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
two-dimensional moving image
- Content type code:
tdi
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017V01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
UCLA High-Speed Laryngeal Video and Audio was developed by UCLA Speech Processing
and Auditory Perception Laboratory and is comprised of high-speed laryngeal video
recordings of the vocal folds and synchronized audio recordings from nine subjects
collected between April 2012 and April 2013. Speakers were asked to sustain the vowel
/i/ for approximately ten seconds while holding voice quality, fundamental frequency,
and loudness as steady as possible. In the field of speech production theory, data
such as contained in this release may be used to study the relationship between vocal
folds vibration and resulting voice quality. *Data* None of the subjects had a history
of a voice disorder. There was no native language requirement for recruiting subjects;
participants were native speakers of various languages, including English, Mandarin
Chinese, Taiwanese Mandarin, Cantonese and German. Audio data is presented as 16kHz
16-bit flac and video is in avi format at 5 fps (frames per second).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Video recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chen, Gang
ADDED ENTRY--PERSONAL NAME
- Personal name:
Neubauer, Juergen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garellek, Marc
ADDED ENTRY--PERSONAL NAME
- Personal name:
Samlan, Robin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gerratt, Bruce R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kreiman, Jody
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alwan, Abeer
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017V01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638013
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 071-714-384-459-0
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CHiME2 WSJ0 was developed as part of The 2nd CHiME Speech Separation and Recognition
Challenge and contains approximately 166 hours of English speech from a noisy living
room environment. The CHiME Challenges focus on distant-microphone automatic speech
recognition (ASR) in real-world environments. CHiME2 WSJ0 reflects the medium vocabulary
track of the CHiME2 Challenge. The target utterances were taken from CSR-I (WSJ0)
Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall
Street Journal news text. LDC also released CHiME2 Grid (LDC2017S07) and CHiME3 (LDC2017S24).
*Data* Data is divided into training, development and test sets. All data is provided
as 16 bit WAV files sampled at 16 kHz. The noisy utterances are in isolated form and
in embedded form. The latter involves five seconds of background noise before and
after the utterance. Seven hours of noise background not part of the training set
are also included. Also included are baseline scoring, decoding and retraining tools
based on Cambridge University' s tool, HTK (the Hidden Markov Toolkit) and related
recipes. These tools include three baseline speaker-independent recognition systems
trained on clean, reverberated and noisy data, respectively, and a number of scripts.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Vincent, Emmanuel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Barker, Jon
ADDED ENTRY--PERSONAL NAME
- Personal name:
Watanabe, Shinji
ADDED ENTRY--PERSONAL NAME
- Personal name:
Le Roux, Jonathan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Nesta, Francesco
ADDED ENTRY--PERSONAL NAME
- Personal name:
Matassoni, Marco
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638021
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 335-339-972-504-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Abstract Meaning Representation (AMR) Annotation Release 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the
Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's
Computational Language and Educational Research group and the Information Sciences
Institute at the University of Southern California. It contains a sembank (semantic
treebank) of over 39,260 English natural language sentences from broadcast conversations,
newswire, weblogs and web discussion forums. AMR captures “who is doing what to whom”
in a sentence. Each sentence is paired with a graph that represents its whole-sentence
meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles,
within-sentence coreference, named entity annotation, modality, negation, questions,
quantities, and so on to represent the semantic structure of a sentence largely independent
of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation
Release 1.0 (LDC2014T12). *Data* The source data includes discussion forums collected
for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin
Chinese broadcast news programming from China Central TV, Wall Street Journal text,
translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and
weblog data used in the DARPA GALE program. The following table summarizes the number
of training, dev, and test AMRs for each dataset in the release. Totals are also provided
by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133
1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English
6455 210 229 6894 DEFT DF English 19558 0 0 19558 Guidelines AMRs 819 0 0 819 2009
Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Xinhua MT 741
99 86 926 Totals 36521 1368 1371 39260 For those interested in utilizing a standard/community
partition for AMR research (for instance in development of semantic parsers), data
in the "split" directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test
partitions, with most smaller datasets assigned to one of the splits as a whole. Note
that splits observe document boundaries. The "unsplit" directory contains the same
39,260 AMRs with no train/dev/test partition.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Knight, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Badarau, Bianca
ADDED ENTRY--PERSONAL NAME
- Personal name:
Baranescu, Laura
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bonial, Claire
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bardocz, Madalina
ADDED ENTRY--PERSONAL NAME
- Personal name:
Griffitt, Kira
ADDED ENTRY--PERSONAL NAME
- Personal name:
Hermjakob, Ulf
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marcu, Daniel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ADDED ENTRY--PERSONAL NAME
- Personal name:
O'Gorman, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Schneider, Nathan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638056
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 217-906-813-531-9
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Metalogue Multi-Issue Bargaining Dialogue
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium
under the European Community's Seventh Framework Programme for Research and Technological
Development. This release consists of approximately 2.5 hours of semantically annotated
English dialogue data that includes speech and transcripts. The goal of the Metalogue
project was to develop a dialogue system with flexible dialogue management to enable
the system's behavior in setting goals, choosing strategies and monitoring various
processes. Participants were involved in a multi-issue bargaining scenario in which
a representative of a city council and a representative of small business owners negotiated
the implementation of new anti-smoking regulations. The negotiation involved four
issues, each with four or five options. Participants received a preference profile
for each scenario and negotiated for an agreement with the highest value based on
their preference information. Negotiators were not allowed to accept an agreement
with a negative value or to share their preference profiles with other participants.
*Data* Six unique subjects (undergraduates between 19 and 25 years of age) participated
in the collection. The dialogue speech was captured with two headset microphones and
saved in 16kHz, 16-bit mono linear PCM FLAC format. Speech signal files are of two
types: full dialogue session; and segmented speech signal, cut per speaker and roughly
per turn. Transcripts were produced semi-automatically, using an automatic speech
recognizer followed by manual correction. Seven types of annotation were performed
manually using the Anvil tool: dialogue act annotations; discourse structure acts;
contact management acts; task management dialogue acts; negotiation moves; rhetorical
relations; and disfluencies in speech production. More information about the annotation
process is included in the documentation. All text is presented in UTF-8 as either
plain text or XML.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Petukhova, Volha
ADDED ENTRY--PERSONAL NAME
- Personal name:
Malchanau, Andrei
ADDED ENTRY--PERSONAL NAME
- Personal name:
Oualil, Youssef
ADDED ENTRY--PERSONAL NAME
- Personal name:
Klakow, Dietrich
ADDED ENTRY--PERSONAL NAME
- Personal name:
Stevens, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
de Weerd, Harmen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taatgen, Niels
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638048
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 012-801-947-534-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
KSUEmotions was developed by King Saud University (KSU) and contains approximately
five hours of emotional Modern Standard Arabic (MSA) speech from 23 subjects. Speakers
were from three countries: Yemen, Saudi Arabia and Syria. Subjects read MSA sentences
from newswire text in the following emotions: neutral, anger, sadness, happiness,
surprise, and interrogative (asking a question). Human reviewers then listened to
the recordings to identify the emotion they heard.. *Data* Audio was recorded in each
participant's home. Audio is presented as 16-bit 16 kHz flac compressed wav. In addition
to speech files and metadata about the speakers, timeless label files and automatic
time segmentation alignment files are included. Text is presented as UTF-8 plain text.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Meftah, Ali Hamid
ADDED ENTRY--PERSONAL NAME
- Personal name:
Alotaibi, Yousef Ajami
ADDED ENTRY--PERSONAL NAME
- Personal name:
Selouani, Sid-Ahmed
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638064
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 307-154-318-802-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BOLT English Discussion Forums
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
BOLT English Discussion Forums was developed by the Linguistic Data Consortium (LDC)
and consists of 830,440 discussion forum threads in English harvested from the Internet
using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational
Language Translation) program developed machine translation and information retrieval
for less formal genres, focusing particularly on user-generated content. LDC supported
the BOLT program by collecting informal data sources -- discussion forums, text messaging
and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated
and annotated for various tasks including word alignment, treebanking, propbanking
and co-reference. The material in this release represents the unannotated English
source data in the discussion forum genre. *Data* Collection was seeded based on the
results of manual data scouting by native speaker annotators. Scouts were instructed
to seek content in English that was original, interactive and informal. Upon locating
an appropriate thread, scouts submitted the URL and some simple judgments about it
to a database, via a web browser plug-in. When multiple threads from a forum were
submitted, the entire forum was automatically harvested and added to the collection.
The scale of the collection precluded manual review of all data. Only a small portion
of the threads included in this release were manually reviewed, and it is expected
that there may be some offensive or otherwise undesired content as well as some threads
that contain a large amount of non-English content. Language identification was performed
on all threads in this corpus (using CLD2), and threads for which the results indicate
a high probability of largely non-English content are listed in eng_suspect_LID.txt
in the docs directory of this package. The corpus is comprised of zipped HTML and
XML files. The HTML files are a raw HTML file downloaded from the discussion thread.
If the thread spanned multiple URLs, it was stored as a concatenation of the downloaded
HTML files. The XML files were converted from the raw HTML. *Acknowledgement* This
material is based upon work supported by the Defense Advanced Research Projects Agency
(DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect
the position or the policy of the Government, and no official endorsement should be
inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tracey, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u tam d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638072
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 744-786-064-362-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
tam
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tam
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b was developed by Appen for the
IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 200 hours of Tamil conversational and scripted telephone speech collected
in 2012 and 2013 along with corresponding transcripts. The Babel program focuses on
underserved languages and seeks to develop speech recognition technology that can
be rapidly applied to any human language to support keyword search performance over
large amounts of recorded speech. *Data* The Tamil speech in this release represents
that spoken in the Northern, Central, Southern and Western dialect regions of the
Indian state of Tamil Nadu. The gender distribution among speakers is approximately
equal; speakers' ages range from 16 years to 65 years. Calls were made using different
telephones (e.g., mobile, landline) from a variety of environments including the street,
a home or office, a public place, and inside a vehicle. Audio data is presented as
8kHz 8-bit a-law encoded audio in sphere format and 48kHz 24-bit PCM encoded audio
in wav format. Transcripts are encoded in UTF-8. The romanization scheme was developed
by Appen. Further information about transcription methodology is contained in the
documentation accompanying this release. Evaluation data is available from NIST in
support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Tamil. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Conners, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Corris, Miriam
ADDED ENTRY--PERSONAL NAME
- Personal name:
David, Anne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kaiser-Schatzlein, Alice
ADDED ENTRY--PERSONAL NAME
- Personal name:
Melot, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Paget, Shelley
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silber, Ronnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Viswanath, Arun
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u ben d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638080
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 809-338-263-232-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
lah
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
urd
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ben
- Language code of text/sound track or separate title:
hin
- Language code of text/sound track or separate title:
pnb
- Language code of text/sound track or separate title:
tam
- Language code of text/sound track or separate title:
urd
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multi-Language Conversational Telephone Speech 2011 -- South Asian
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multi-Language Conversational Telephone Speech 2011 -- South Asian was developed by
the Linguistic Data Consortium (LDC) and is comprised of approximately 118 hours of
telephone speech in five distinct language varieties of South Asia (i.e. the Indian
sub-continent): Bengali, Hindi, Punjabi, Tamil and Urdu. The data were collected primarily
to support research and technology evaluation in automatic language identification,
and portions of these telephone calls were used in the NIST 2011 Language Recognition
Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24 languages/dialects,
some of which could be considered mutually intelligible or closely related. LDC has
also released the following as part of the Multi-Language Conversational Telephone
Speech 2011 series: * Slavic Group (LDC2016S11) * Turkish (LDC2017S09) * Central Asian
(LDC2018S03) *Data* Participants were recruited by native speakers who contacted acquaintances
in their social network. Those native speakers made one call, up to 15 minutes, to
each acquaintance. The data was collected using LDC's telephone collection infrastructure,
comprised of three computer telephony systems. Human auditors labeled calls for callee
gender, dialect type and noise. Demographic information about the participants was
not collected. All audio data are presented in FLAC-compressed MS-WAV (RIFF) file
format (*.flac); when uncompressed, each file is 2 channels, recorded at 8000 samples/second
with samples stored as 16-bit signed integers, representing a lossless conversion
from the original mu-law sample data as captured digitally from the public telephone
network. The following table summarizes the total number of calls, total number of
hours of recorded audio, and the total size of compressed data: group lng #calls #hours
#MB s_asian ben 118 26.6 1374 s_asian hin 37 7.4 383 s_asian pnb 207 38.8 1921 s_asian
tam 101 22.9 1140 s_asian urd 116 22.9 1140 s_asian Totals 579 118.3 5913
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Bengali, Hindi, Western Panjabi, Tamil, and Urdu. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jones, Karen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638102
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 805-236-402-881-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Arabic Broadcast Conversation Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Arabic Broadcast Conversation Speech was developed by the Linguistic
Data Consortium (LDC) and is comprised of approximately 75 hours of Arabic broadcast
conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and
MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast Conversation
Transcripts (LDC2017T12). Broadcast audio for the GALE program was collected at LDC’s
Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University
of Science and Technology (HKUST), Hong Kong (Chinese), Medianet (Tunis, Tunisia)
(Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast
collection supported GALE at a rate of approximately 300 hours per week of programming
from more than 50 broadcast sources for a total of over 30,000 hours of collected
broadcast audio over the life of the program. LDC’s local broadcast collection system
is highly automated, easily extensible and robust and capable of collecting, processing
and evaluating hundreds of hours of content from several dozen sources per day. The
broadcast material is served to the system by a set of free-to-air (FTA) satellite
receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast
satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between
receivers and recorders is dynamic and modular. All signal routing is performed under
computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high
bandwidth A/V format and are then processed to extract audio, to generate keyframes
and compressed audio/video, to produce time-synchronized closed captions (in the case
of North American English) and to generate automatic speech recognition (ASR) output.
An overview of the system, the sources recorded and the configuration of the recording
laboratory are contained in the Guidelines for Broadcast Audio Collection Version
3.0 included in this release. LDC designed a portable platform for remote broadcast
collection. This is a TiVO-style digital video recording (DVR) system that records
two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL)
and FTA DVB-S satellite programming and can operate outside of the United States.
It has a small footprint, weighs less than 30 pounds and can be transported as carry-on
luggage. Medianet collected Arabic programming from across the Gulf region using its
internal system and LDC's portable broadcast collection platform installed in 2008.
The portable platform deployed at the Medianet Tunisian collection facility collected
multiple streams of regional Arabic programming from various sources. MTC collected
Arabic programming using its internal collection system. *Data* The broadcast conversation
recordings in this release feature interviews, call-in programs and roundtable discussions
focusing principally on current events from the following sources: Al Alam News Channel,
based in Iran; Al Fayhaa, an Iraqi television channel; Al Hiwar, a regional broadcast
station based in the United Kingdom; Alnurra, a U.S. government-funded regional broadcaster;
Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national
broadcast station in Jordan; Dubai TV, a broadcast station in the United Arab Emirates;
Lebanese Broadcasting Corporation, a Lebanese television station; Saudi TV, a national
television station based in Saudi Arabia; Syria TV, the national television station
in Syria; and Tunisian National TV, a national television station in Tunisia. This
release contains 83 audio files presented in FLAC-compressed Waveform Audio File format
(.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic
speaker following Audit Procedure Specification Version 2.0 which is included in this
release. The broadcast auditing process served three principal goals: as a check on
the operation of the broadcast collection system equipment by identifying failed,
incomplete or faulty recordings; as an indicator of broadcast schedule changes by
identifying instances when the incorrect program was recorded; and as a guide for
data selection by retaining information about a program’s genre, data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic and Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638099
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 880-139-952-587-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
arb
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Arabic Broadcast Conversation Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Arabic Broadcast Conversation Transcripts was developed by the Linguistic
Data Consortium (LDC) and contains transcriptions of approximately 75 hours of Arabic
broadcast conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis,
Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous
Language Exploitation) program. Corresponding audio data is released as GALE Phase
4 Arabic Broadcast Conversation Speech (LDC2017S15). The broadcast conversation recordings
feature interviews, call-in programs and roundtable discussions focusing principally
on current events from the following sources: Al Alam News Channel, based in Iran;
Al Fayhaa, an Iraqi television channel; Al Hiwar, a regional broadcast station based
in the United Kingdom; Alnurra, a U.S. government-funded regional broadcaster; Aljazeera,
a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast
station in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Lebanese
Broadcasting Corporation, a Lebanese television station; Saudi TV, a national television
station based in Saudi Arabia; Syria TV, the national television station in Syria;
and Tunisian National TV, a national television station in Tunisia. *Data* The transcript
files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed
data totals 475,211 tokens. The transcripts were created with the LDC tool XTrans,
which supports manual transcription and annotation of audio recordings. XTrans is
available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)
verbatim, time-aligned transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR annotation adds structural
information such as topic boundaries and manual sentence unit annotation to the core
components of a quick transcript. Files with QTR as part of the filename were developed
using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic and Standard Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638110
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S16
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
LDC Spoken Language Sampler - Fourth Release
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LDC (Linguistic Data Consortium) Spoken Language Sampler - Fourth Release, LDC catalog
number LDC2017S16 and ISBN 1-58563-811-0, contains samples from 18 different corpora
published by LDC between 1996 and 2017. LDC distributes a wide and growing assortment
of resources for researchers, engineers and educators whose work is concerned with
human languages. Historically, most linguistic resources were not generally available
to interested researchers but were restricted to single laboratories or to a limited
number of users. Inspired by the success of selected readily-available and well-known
data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide
a new mechanism for large-scale corpus development and resource sharing. With the
support of its members, LDC provides critical services to the language research community
that include: maintaining the LDC data archives, producing and distributing data via
media or web download, negotiating intellectual property agreements with potential
information providers and maintaining relations with other like-minded groups around
the world. Resources available from LDC include speech, text, video data and lexicons
in multiple languages, as well as software tools to facilitate the use of corpus materials.
For a complete view of LDC's publications, browse the Catalog. The sampler is available
as a free download. *Data* The LDC Spoken Language Sampler - Fourth Release provides
speech and transcript samples and is designed to illustrate the variety and breadth
of the speech-related resources available from the LDC Catalog. The sound files included
in this release are excerpts that have been modified in various ways relative to the
original data as published by LDC: * Most excerpts are truncated to be much shorter
than the original files, typically between 1.5 and 2 minutes. * Signal amplitude has
been adjusted where necessary to normalize playback volume. * Some corpora are published
in compressed form, but all samples here are uncompressed. * Some text files are presented
as images to ensure foreign character sets display properly. * In some publications,
NIST SPHERE file format is used for audio data, but the audio files in this sampler
are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities.
FLAC files have been expanded into their wav form as well. The link for the catalog
number takes you to the catalog entry, and the link for the title takes you to further
documentation for that corpus. LDC2017S06 2010 NIST Speaker Recognition Evaluation
Test Set 2010 NIST Speaker Recognition Evaluation Test Set was developed by LDC and
NIST (National Institute of Standards and Technology). It contains 2,255 hours of
American English telephone speech and interview speech recorded over a microphone
channel used as test data in the NIST-sponsored 2010 Speaker Recognition Evaluation
(SRE). LDC2015S10 Arabic Learner Corpus Arabic Learner Corpus was developed at the
University of Leeds and consists of written essays and spoken recordings by Arabic
learners collected in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words
in 1,585 materials, produced by 942 students from 67 nationalities studying at pre-university
and university levels. The average length of an essay is 178 words. LDC2015S12 Articulation
Index LSCP Articulation Index LSCP was developed by researchers at Laboratoire de
Sciences Cognitives et Psycholinguistique (LSCP), Ecole Normale Supérieure. It revises
and enhances a subset of Articulation Index (AIC) (LDC2005S22), a corpus of persons
speaking English syllables. Changes include the addition of forced alignment to sound
files, time alignment of syllable utterances and format conversions. LDC2014S01 CALLFRIEND
Farsi Second Edition Speech CALLFRIEND Farsi Second Edition Speech was developed by
LDC and consists of approximately 42 hours of telephone conversation (100 recordings)
among native Farsi speakers. The CALLFRIEND project supported the development of language
identification technology. Each CALLFRIEND corpus consists of unscripted telephone
conversations lasting between 5-30 minutes. LDC2016S04 CHM150 CHM150 (Corpus Hecho
en México 150) was developed by the Speech Processing Laboratory of the Faculty of
Engineering at the National Autonomous University of Mexico (UNAM) and consists of
approximately 1.63 hours of Mexican Spanish speech, associated transcripts, and speaker
metadata. The goal of this work was to support spoken term detection and forensic
speaker identification. LDC2007S18 CSLU: Kids` Speech Version 1.1 CSLU: Kids' Speech
Version 1.1 is a collection of spontaneous and prompted speech from 1100 children
between Kindergarten and Grade 10 in the Forest Grove School District in Oregon. Approximately
100 children at each grade level read around 60 items from a total list of 319 phonetically-balanced
but simple words, sentences or digit strings. Each utterance of spontaneous speech
begins with a recitation of the alphabet and contains a monologue of about one minute
in length. This release consists of 1017 files containing approximately 8-10 minutes
of speech per speaker. Corresponding word-level transcriptions are also included.
LDC2016S12 IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a IARPA Babel Georgian
Language Pack IARPA-babel404b-v1.0a was developed by Appen for the IARPA (Intelligence
Advanced Research Projects Activity) Babel program. It contains approximately 190
hours of Georgian conversational and scripted telephone speech collected in 2014-2015
along with corresponding transcripts. LDC2003S07 Korean Telephone Conversations Complete
(S), (T), (L) The Korean telephone conversations were originally recorded as part
of the CALLFRIEND project. Korean Telephone Conversations Speech consists of 100 telephone
conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the
remaining 51 are previously unexposed calls. Korean Telephone Conversations Transcripts
consists of 100 text files, totaling approximately 190K words and 25K unique words.
All files are in Korean orthography: orthographic Korean characters are in Hangul,
encoded in KSC5601 (Wansung) system. The complete set of Korean Telephone Conversations
also includes a transcript (LDC2003T08) and lexicon (LDC2003L02) corpus. LDC2012S04
Malto Speech and Transcripts Malto Speech and Transcripts contains approximately 8
hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males,
5 females), accompanying transcripts, English translations and glosses for 6 hours
of the collection. Speakers were asked to talk about themselves, their lives, rituals
and folklore; elicitation interviews were then conducted. The goal of the work was
to present the current state and dialectal variation of Malto. LDC2015S04 Mandarin-English
Code-Switching in South-East Asia Mandarin-English Code-Switching in South-East Asia
was developed by Nanyang Technological University and Universiti Sains Malaysia and
includes approximately 192 hours of Mandarin-English code-switching speech from 156
speakers with associated transcripts. LDC2017S11 Metalogue Multi-Issue Bargaining
Dialogue Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue
Consortium under the European Community's Seventh Framework Programme for Research
and Technological Development. This release consists of approximately 2.5 hours of
semantically annotated English dialogue data that includes speech and transcripts.
LDC2016S11 Multi-Language Conversational Telephone Speech 2011 -- Slavic Group Multi-Language
Conversational Telephone Speech 2011 – Slavic Group was developed by LDC and is comprised
of approximately 60 hours of telephone speech in Polish, Russian and Ukrainian. The
data was collected to support research and technology evaluation in automatic language
identification, specifically language pair discrimination for closely related languages/dialects.
Portions of these telephone calls were used in the NIST 2011 Language Recognition
Evaluation. LDC2017S09 Multi-Language Conversational Telephone Speech 2011 Multi-Language
Conversational Telephone Speech 2011 -- Turkish was developed by LDC and is comprised
of approximately 18 hours of telephone speech in Turkish. The data was collected primarily
to support research and technology evaluation in automatic language identification,
specifically language pair discrimination for closely related languages/dialects.
LDC2004S09 NIST Meeting Pilot Corpus Speech The audio data included in this corpus
was collected in the NIST Meeting Data Collection Laboratory for the NIST Automatic
Meeting Recognition Project. The corresponding transcripts are available as the NIST
Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13), while the video files
will be published later as NIST Meeting Pilot Corpus Video. For more information regarding
the data collection conditions, meeting scenarios, transcripts, speaker information,
recording logs, errata, and other ancillary data for the corpus, please consult the
NIST project website for this corpus. LDC2017S04 Noisy TIMIT Speech Noisy TIMIT Speech
was developed by the Florida Institute of Technology and contains approximately 322
hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1)
modified with different additive noise levels. Only the audio has been modified; the
original arrangement of the TIMIT corpus is still as described by the TIMIT documentation.
LDC2015S08 The Walking Around Corpus The Walking Around Corpus was developed by Stony
Brook University and is comprised of approximately 33 hours of navigational telephone
dialogues from 72 speakers (36 speaker pairs). Participants were Stony Brook University
students who identified themselves as native English speakers. LDC2012S02 TORGO Database
of Dysarthric Articulation TORGO contains approximately 23 hours of English speech
data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females)
with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males,
3 females) from a non-dysarthric control group. LDC2014S04 USC-SFI MALACH Interviews
and Transcripts Czech USC-SFI MALACH Interviews and Transcripts Czech was developed
by The University of Southern California Shoah Foundation Institute (USC-SFI) and
the University of West Bohemia as part of the MALACH (Multilingual Access to Large
Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420
interviewees along with transcripts and other documentation.
LANGUAGE NOTE
- Language note:
Content in . Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638145
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 044-780-649-668-4
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Vehicle City Voices Corpus – Part I
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Vehicle City Voices Corpus – Part I was developed at the University of Michigan-Flint,
and is an ongoing oral history project and survey of English language variation in
Flint, Michigan. It contains approximately 16 hours of speech with corresponding transcripts
from interviews of Flint residents conducted between 2012 and 2015. The corpus was
designed to provide high-quality recordings for acoustic analysis and to examine narrative
structure and discursive construction of individual and collective identity in urban
spaces. *Data* This release is comprised of 21 interviews by undergraduate and graduate
students for civic engagement projects in linguistics courses and by a graduate student
research assistant. Participants (11 female, 10 male) were born between 1935 and 1991
and represented a range of ages, genders, and ethnicities. Of the interviewees, 11
were Black/African American, 8 were White/Caucasian, and 2 were biracial/mixed ethnic
heritage. Interviews took place in various locations in Flint, including university
and community spaces and a church meeting room. Questions focused on recollections
of important community events, remembrances about the community, the interviewee's
relationship to the auto industry and the city's physical transformation, among other
topics. Sessions were recorded using Marantz PMD661 portable SD recorders with accompanying
Audio-Technica AT831B lavalier condenser microphones. The original recordings were
uncompressed (PCM-16) sound files stored in WAV format recorded at a sampling rate
of 44,100 Hz. These files were then converted to FLAC format. Transcripts are plain
text and UTF-8. Metadata (where provided by participants) includes information on
gender, ethnicity, year of birth, level of education, field of employment, average
income, length of time living in Flint and its surrounding areas, as well as interviewer
age, gender, and ethnicity. In addition, original interview durations, edited interview
durations, interview year, and transcript word counts are also provided in the metadata
file.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Britt, Erica
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638129
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 228-559-981-287-1
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2015-2016 CoNLL Shared Task
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2015-2016 CoNLL Shared Task, LDC Catalog Number LDC2017T13 and ISBN 1-58563-812-9,
contains the Chinese and English training, development and test data for the 2015
and 2016 CoNLL (Conference on Computational Natural Language Learning) Shared Task
Evaluation which focused on shallow discourse parsing. The Conference on Computational
Natural Language Learning (CoNLL) is accompanied every year by a shared task intended
to promote natural language processing applications and evaluate them in a standard
setting. Shallow discourse parsing is the task of parsing a piece of text into a set
of discourse relations between two adjacent or non-adjacent discourse units. This
task is called shallow discourse parsing because the relations in a text are not connected
to one another to form a connected structure in the form of a tree or graph. LDC has
also released the following CoNLL Shared Task data sets: * 2006 CoNLL Shared Task
- Ten Languages (LDC2015T11) * 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12)
* 2008 CoNLL Shared Task Data (LDC2009T12) * 2009 CoNLL Shared Task Part 1 (LDC2012T03)
* 2009 CoNLL Shared Task Part 2 (LDC2012T04) *Data* This release consists of the tokenized,
tagged, and parsed tags in English and Chinese. The English train, dev and test data
are from Wall Street Journal material in Penn Discourse Treebank Version 2.0 (LDC2008T05);
English blind test data are from wikinews. Chinese train, dev and test data are news
material from Chinese Discourse Treebank 0.5 (LDC2014T21); Chinese blind test data
are from wikinews.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Chinese, and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xue, Nianwen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ng, Hwee Tou
ADDED ENTRY--PERSONAL NAME
- Personal name:
Pradhan, Sameer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rutherford, Attapol T.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Webber, Bonnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wang, Chuan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wang, Hong Min
ADDED ENTRY--PERSONAL NAME
- Personal name:
Prasad, Rashmi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T13
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638137
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 856-028-165-105-5
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
SRI-FRTIV (Five-way Recorded Toastmaster Intrinsic Variation) was developed by SRI
International in 2007-2008 and is comprised of approximately 232 hours of English
speech from thirty-four speakers who were members of Toastmaster clubs. Participants
were asked to speak at three different levels of effort (low, normal and high) in
four different styles (interview, conversation, reading and oration) to study the
question of how intrinsic variations -- associated with the speaker rather than the
recording environment -- affect text-independent speaker verification. *Data* Participants
were native speakers of North American English who were members of local Toastmasters
clubs and had experience in public speaking. This release includes demographic information
for 30 speakers (15 male, 15 female), including gender, birth year, height, education
level, years in Toastmasters, and a self-evaluation of speaking skills. Not all effort
levels were applicable for each speaking style and so were not collected. Interviews
and phone conversations were not recorded at high effort and oration was not recorded
at low or normal effort levels. Speech data is presented as 16kHz 16-bit single channel
flac compressed pcm wav (.flac).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shriberg, Elizabeth
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kathol, Andreas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graciarena, Martin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bratt, Harry
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kajarekar, Sachin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jameel, Huda
ADDED ENTRY--PERSONAL NAME
- Personal name:
Richey, Colleen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Goodman, Fred
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u zul d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638153
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S19
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 562-487-689-567-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
zul
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
zul
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S19
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e was developed by Appen for the
IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 211 hours of Zulu conversational and scripted telephone speech collected
in 2012 and 2013 along with corresponding transcripts. The Babel program focuses on
underserved languages and seeks to develop speech recognition technology that can
be rapidly applied to any human language to support keyword search performance over
large amounts of recorded speech. *Data* The Zulu speech in this release represents
that spoken in the KZN (KwaZulu-Natal)-urban dialect region of South Africa. The gender
distribution among speakers is approximately equal; speakers' ages range from 16 years
to 70 years. Calls were made using different telephones (e.g., mobile, landline) from
a variety of environments including the street, a home or office, a public place,
and inside a vehicle. Audio data is presented as 8kHz 8-bit a-law encoded audio in
sphere format and 48kHz 24-bit PCM encoded audio in wav format. Transcripts are encoded
in UTF-8. Further information about transcription methodology is contained in the
documentation accompanying this release. Evaluation data is available from NIST in
support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Zulu. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Adams, Nikki
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Conners, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lin, Willa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Melot, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silber, Ronnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wong, Jamie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S19
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u ger d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638420
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T05
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 553-412-087-213-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
ger
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
deu
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
H2, E2, ERK1 Children's Writing
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
still image
- Content type code:
sti
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T05
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
H2, E2, ERK1 Children's Writing was developed by the Cooperative State University
Baden-Württemberg, University of Education. It consists of approximately 2,000 texts
written over four months by 173 German school children age six through eleven years.
The data in this corpus was collected by elementary schools in Baden Württemberg,
Germany and digitized at the Cooperative State University during the 2016/2017 school
year. Three second, third, and fourth grade classrooms participated in the collection.
Texts were written within regular class settings. The students were presented with
a picture and were asked to write a story, to describe the picture or if unable to
write a text, to list what they saw in the picture. The pictures were designed to
enhance the output with respect to important spelling error categories, namely, the
marking of short vowels with a silent consonant letter and the correct spelling of
the long vowel <ie>. The children were allowed at least 15 minutes to write the texts.
This exercise was repeated weekly for nine or sixteen weeks depending on the program.
LDC has also released H1 Children's Writing (LDC2016T01). *Data* There were 173 total
participants. 100 students were multilingual, and further metadata is available for
166 of the 173 children. The following is included for each text in the database:
school week of collection; school type; age; gender; grade/classroom; language spoken
at home; and school materials used. In all, 2,117 texts representing 118,621 tokens
were collected. The texts were digitized in two forms: (1) the original text, including
all errors (achieved), and (2) the intended (target) text, where all spelling errors
were removed. Annotations were added to both the achieved text and the target text
to distinguish words that should not be analyzed for spelling errors, such as names
or foreign words. For sentence-level analysis, syntax errors were annotated by marking
substitutions, deletions and insertions at the word level. In such cases, the used
word was analyzed for spelling, and the correct word was used for sentence structure
analysis. Original handwriting is presented as pdf documents and the converted text
as UTF-8 plain text in csv documents.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in German. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Pictures
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Berkling, Kay
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T05
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638161
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T14
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 924-985-704-453-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
lzh
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Ancient Chinese Corpus
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T14
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Ancient Chinese Corpus was developed at Nanjing Normal University. It contains word-segmented
and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to
date from the Warring States Period (475-221 BC). Zuozhuan is a commentary on the
Chunqui, a history of the Chinese Spring and Autumn period (770-476 BC). This release
is part of a continuing project to develop a large, part-of-speech tagged ancient
Chinese corpus. *Data* Ancient Chinese Corpus consists of 180,000 Chinese characters
and 195,000 segment units (including words and punctuation). The part-of-speech tag
set was developed by Nanjing Normal University and contains 17 tags. This release
contains two text files: 268 paragraphs and 10,560 lines. A line is one sentence;
paragraphs are separated by one empty line. Each word is tagged with its part-of-speech
and separated by a space. The files are presented in UTF-8 plain text files using
traditional Chinese script.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Literary Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Chen, Xiaohe
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Feng, Minxuan
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Xu, Runhua
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wang, Qingqing
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T14
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u sem d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S20
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 834-222-629-362-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
sem
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
per
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ajp
- Language code of text/sound track or separate title:
apc
- Language code of text/sound track or separate title:
fas
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
RATS Keyword Spotting
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S20
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
RATS Keyword Spotting was developed by the Linguistic Data Consortium (LDC) and is
comprised of approximately 3,100 hours of Levantine Arabic and Farsi conversational
telephone speech with automatic and manual annotation of speech segments, transcripts
and keywords generated from transcript content. The corpus was created to provide
training, development and initial test sets for the keyword spotting (KWS) task in
the DARPA RATS (Robust Automatic Transcription of Speech) program. The goal of the
RATS program was to develop human language technology systems capable of performing
speech detection, language identification, speaker identification and keyword spotting
on the severely degraded audio signals that are typical of various radio communication
channels, especially those employing various types of handheld portable transceiver
systems. To support that goal, LDC assembled a system for the transmission, reception
and digital capture of audio data that allowed a single source audio signal to be
distributed and recorded over eight distinct transceiver configurations simultaneously.
Those configurations included three frequencies -- high, very high and ultra high
-- variously combined with amplitude modulation, frequency hopping spread spectrum,
narrow-band frequency modulation, single-side-band or wide-band frequency modulation.
Annotations on the clear source audio signal, e.g., time boundaries for the duration
of speech activity, were projected onto the corresponding eight channels recorded
from the radio receivers. *Data* The source audio consists of conversational telephone
speech recordings collected by LDC: (1) data collected for the RATS program from Levantine
Arabic and Farsi speakers; and (2) material from Levantine Arabic QT Training Data
Set 5, Speech (LDC2006S29) and CALLFRIEND Farsi Second Edition Speech (LDC2014S01).
Annotation was performed in two steps. Transcripts of calls were either produced or
already available from the source corpora. For the CALLFRIEND Farsi calls, transcripts
were updated by native Farsi speakers. Potential target keywords were selected from
the transcripts on the basis of overall word frequencies to fall within a given range
of target-word likelihood per hour of speech. The selected words were then reviewed
by native speakers to confirm that each selection was a regular word or multi-word
expression of more than three syllables. All audio files are presented as single-channel,
16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files;
when uncompressed, the files have typical "MS-WAV" (RIFF) file headers. The data is
divided for use as training, initial development set, and initial evaluation set (note
that the initial evaluation only used Levantine Arabic data).
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in South Levantine Arabic, North Levantine Arabic, and Persian. Documentation
in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jones, Karen
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S20
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638188
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T15
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 385-163-116-259-0
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
English Web Treebank Propbank
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T15
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
English Web Treebank Propbank, LDC Catalog Number LDC2017T15 and ISBN 1-58563-818-8,
was developed by the University of Colorado Boulder - CLEAR (Computational Language
and Education Research) and provides predicate-argument structure annotation for English
Web Treebank (LDC2012T13). The goal of Propbank (or proposition bank) annotation is
to develop annotations with information about basic semantic propositions. English
Web Treebank Propbank provides semantic role annotation and predicate sense disambiguation
for roughly 50,000 predicates, corresponding to all verbs, all adjectives in equational
clauses and all nouns considered to be predicative. Mark-up is in the "unified" propbank
annotation format, which combines representations in nouns, verbs and adjectives.
*Data* The source data consists of weblogs, newsgroups, email, reviews and questions-answers.
Human annotators followed the guidelines included with this release. Annotated propositions
were automatically validated to ensure that (1) pointers to the tree nodes were valid,
(2) Propbank labels were valid, and (3) Propbank annotation was consistent with the
associated frameset. Additionally, XML frame files were validated against the included
dtd and were checked for frame internal consistency (e.g. misspelling, extraneous
characters, general correctness). Data is presented in UTF-8 XML files.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
O'Gorman, Tim
ADDED ENTRY--PERSONAL NAME
- Personal name:
Conger, Katherine
ADDED ENTRY--PERSONAL NAME
- Personal name:
Palmer, Martha
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T15
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638196
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T16
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 386-404-178-211-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
MWE-Aware English Dependency Corpus 2.0
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T16
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
MWE-Aware English Dependency Corpus Version 2.0 was developed by the Nara Institute
of Science and Technology Computational Linguistics Laboratory and consists of English
compound function words annotated in dependency format. The data is derived from OntoNotes
Release 5.0 (LDC2013T19). Compound functions words are a type of multiword expression
(MWE). MWEs are groups of tokens that can be treated as a single semantic or syntactic
unit. Doing so facilitates natural language processing tasks such as constituency
and dependency parsing. Version 2.0 adds annotations of named entities (persons, locations,
organizations) into dependency trees that are aware of compound function words. Version
1.0 is available from LDC as MWE-Aware English Dependency Corpus (LDC2017T01). *Data*
MWE-Aware English Dependency Corpus Version 2.0 was derived from the Wall Street Journal
portion of OntoNotes Release 5.0. MWEs were identified in OntoNotes' phrase structure
trees and each MWE was established as a single subtree. Those phrase structure subtrees
were then converted to a dependency structure (the Stanford dependencies) in CoNLL
format. The data is split into 1,728 phrase structure trees as *.parse files and a
single 14-column tab separated dependency as a *.conll file. Both file types are encoded
as UTF-8.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kato, Akihiko
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shindo, Hiroyuki
ADDED ENTRY--PERSONAL NAME
- Personal name:
Matsumoto, Yuji
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T16
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S21
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 025-809-032-700-5
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
ASpIRE Development and Development Test Sets
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S21
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
ASpIRE Development and Development Test Sets was developed for the Automatic Speech
recognition In Reverberant Environments (ASpIRE) Challenge sponsored by IARPA (the
Intelligent Advanced Research Projects Activity). It contains approximately 226 hours
of English speech with transcripts and scoring files. The ASpIRE challenge asked solvers
to develop innovative speech recognition systems that could be trained on conversational
telephone speech, and yet work well on far-field microphone data from noisy, reverberant
rooms. Participants had the opportunity to evaluate their techniques on a common set
of challenging data that included significant room noise and reverberation. *Data*
The audio data is a subset of Mixer 6 Speech (LDC2013S03), audio recordings of interviews,
transcript readings and conversational telephone speech collected by the Linguistic
Data Consortium in 2009 and 2010 from native English speakers local to the Philadelphia
area. The transcripts were developed by Appen for the ASpIRE challenge. Data is divided
into development and development test sets. Audio is presented as single channel,
16kHz 16-bit Signed Integer PCM *.wav files. Transcripts are plain text tdf files.
Scoring files are also included.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ADDED ENTRY--PERSONAL NAME
- Personal name:
Appen Pty Ltd
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S21
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u kur d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638218
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S22
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 063-189-164-925-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
kur
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
kmr
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S22
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program.
It contains approximately 203 hours of Kurmanji Kurdish conversational and scripted
telephone speech collected in 2013 and 2014 along with corresponding transcripts.
The Babel program focuses on underserved languages and seeks to develop speech recognition
technology that can be rapidly applied to any human language to support keyword search
performance over large amounts of recorded speech. *Data* The Kurmanji Kurdish speech
in this release represents that spoken in the southeastern and eastern Anatolian regions
of Turkey. The gender distribution among speakers is approximately 37% female and
63% male; speakers' ages range from 16 years to 70 years. Calls were made using different
telephones (e.g., mobile, landline) from a variety of environments including the street,
a home or office, a public place, and inside a vehicle. Audio data is presented as
8kHz 8-bit a-law encoded audio in sphere format and 48kHz 24-bit PCM encoded audio
in wav format. Transcripts are encoded in UTF-8. Further information about transcription
methodology is contained in the documentation accompanying this release. Evaluation
data is available from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Northern Kurdish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Conners, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Heighway, Melanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lin, Willa
ADDED ENTRY--PERSONAL NAME
- Personal name:
Melot, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Paget, Shelley
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Roomi, Bergul
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silber, Ronnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ADDED ENTRY--PERSONAL NAME
- Personal name:
Zwanenburg, Jacqui
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S22
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638234
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T17
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 464-261-620-634-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eng
- Language code of text/sound track or separate title:
zho
- Language code of text/sound track or separate title:
cmn
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation
Data 2011-2014
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T17
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation
Data 2011-2014 was developed by the Linguistic Data Consortium and contains training
and evaluation data produced in support of the TAC KBP Chinese Cross-lingual Entity
Linking tasks in 2011, 2012, 2013 and 2014. It includes queries and gold standard
entity type information, Knowledge Base links, and equivalence class clusters for
NIL entities along with the source documents for the queries, specifically, English
and Chinese newswire, discussion forum and web data. The corresponding knowledge base
is available as TAC KBP Reference Knowledge Base (LDC2014T16). Text Analysis Conference
(TAC) is a series of workshops organized by the National Institute of Standards and
Technology (NIST). TAC was developed to encourage research in natural language processing
and related applications by providing a large test collection, common evaluation procedures,
and a forum for researchers to share their results. Through its various evaluations,
the Knowledge Base Population (KBP) track of TAC encourages the development of systems
that can match entities mentioned in natural texts with those appearing in a knowledge
base and extract novel information about entities from a document collection and add
it to a new or existing knowledge base. Chinese Cross-lingual Entity Linking was first
conducted as part of the 2011 TAC KBP evaluations. The track was an extension of the
monolingual English Entity Linking track (EL) whose goal is to measure systems' ability
to determine whether an entity, specified by a query, has a matching node in a reference
knowledge base (KB) and, if so, to create a link between the two. If there is no matching
node for a query entity in the KB, EL systems are required to cluster the mention
together with others referencing the same entity. More information about the TAC KBP
Entity Linking task and other TAC KBP evaluations can be found on the NIST TAC website.
*Data* All source documents were originally released as XML but have been converted
to text files for this release. This change was made primarily because the documents
were used as text files during data development but also because some fail XML parsing.
*Acknowledgement* This material is based on research sponsored by Air Force Research
Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045.
The U.S. Government is authorized to reproduce and distribute reprints for Governmental
purposes notwithstanding any copyright notation thereon. The views and conclusions
contained herein are those of the authors and should not be interpreted as necessarily
representing the official policies or endorsements, either expressed or implied, of
Air Force Research Laboratory and Defense Advanced Research Projects Agency or the
U.S. Government.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in English, Chinese, and Mandarin Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ellis, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Getman, Jeremy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T17
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638226
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S23
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 273-364-546-427-6
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S23
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería
Eléctrica y Servicio Social) Light was developed by the Speech Processing Laboratory
of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM)
and consists of approximately 18 hours of Mexican Spanish radio and television speech
and associated transcripts. The goal of this work was to create acoustic models for
automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM
Project website. CIEMPIESS Light is an updated version of CIEMPIESS, released by LDC
as LDC2015S07. This "light" version contains speech and transcripts presented in a
revised directory structure that allows for use with the Kaldi toolkit. *Data* The
speech recordings were collected from Podcast UNAM, a program created by Radio-IUS,
and Mirador Universitario, a TV program broadcast by UNAM. They are comprised of spontaneous
conversations in Mexican Spanish between a moderator and guests. Approximately 75%
of the speakers were male, and 25% of the speakers were female. The audio was recorded
in MP3 stereo format, using a 44.1 kHz sample rate and bit-rate of 128 kbps or higher.
Only "clean" utterances were selected from the raw data, meaning that the utterances
were made by only one person with no background noises, whispers, music, foreign accents,
white noise or static. The audio files were converted to 16 kHz, 16-bit PCM flac format
for this release. Transcripts are presented as UTF-8 encoded plain text. *Acknowledgements*
The authors would like to thank Alejandro V. Mena, Elena Vera, Angélica Gutiérrez
and Beatriz Ancira for their support with the social service program: "Desarrollo
de Tecnologías del Habla.” They would also like to thank the social service students
for their hard work.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Mena, Carlos Daniel Hernández
ADDED ENTRY--PERSONAL NAME
- Personal name:
Herrera, Abel
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S23
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638269
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S24
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 857-070-463-285-8
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S24
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
CHiME3 was developed as part of The 3rd CHiME Speech Separation and Recognition Challenge
and contains approximately 342 hours of English speech and transcripts from noisy
environments and 50 hours of noisy environment audio. The CHiME Challenges focus on
distant-microphone automatic speech recognition (ASR) in real-world environments.
See the CHIME3 home page for more information. The task in CHiME3 was similar to the
medium vocabulary track of the CHiME2 Challenge in that the target utterances were
taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of
read speech from Wall Street Journal news text. CHiME3 involved two types of data:
speech data recorded in very noisy environments (on a bus, in a cafe, pedestrian area,
and street junction) and noisy utterances generated by artificially mixing clean speech
data with noisy backgrounds. LDC has also released two CHiME2 corpora -- CHiME2 Grid
(LDC2017S07) and CHiME2 WSJ0 (LDC2017S10). *Data* Data is divided into training, development
and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The audio
data consists of the background noises, enhanced speech data using the baseline speech
enhancement technique, unsegmented noisy speech data, and segmented noisy speech data.
Annotation files are based on JSON (JavaScript Object Notation) format. Transcripts
are plain text in either DOT or TRN format. Also included are three software tools
for acoustic simulation, speech enhancement, and ASR.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Barker, Jon
ADDED ENTRY--PERSONAL NAME
- Personal name:
Marxer, Ricard
ADDED ENTRY--PERSONAL NAME
- Personal name:
Vincent, Emmanuel
ADDED ENTRY--PERSONAL NAME
- Personal name:
Watanabe, Shinji
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S24
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638242
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017T18
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 752-222-758-626-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Chinese Broadcast News Transcripts
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017T18
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Chinese Broadcast News Transcripts was developed by the Linguistic Data
Consortium (LDC) and contains transcriptions of approximately 134 hours of Chinese
broadcast news speech collected in 2008 by LDC and Hong University of Science and
Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. Corresponding audio data is released as GALE Phase
4 Chinese Broadcast News Speech (LDC2017S25). The broadcast news recordings feature
news broadcasts focusing principally on current events from the following sources:
China Central TV (CCTV), a national and international broadcaster in Mainland China;
Phoenix TV, a Hong Kong-based satellite television station; and Voice of America (VOA),
a U.S. government-funded broadcast programmer. *Data* The transcript files are in
plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data
totals 1,696,879 tokens. The transcripts were created with the LDC-developed transcription
tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that
supports manual transcription and annotation of audio recordings. XTrans is available
from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors
under contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are included
in the documentation with this release. QTR transcription consists of quick (near-)
verbatim, time-aligned transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR annotation adds structural
information such as topic boundaries and manual sentence unit annotation to the core
components of a quick transcript. Files with QTR as part of the filename were developed
using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Glenn, Meghan
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017T18
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2017 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638269
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2017S25
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 903-409-163-576-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
chi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
zho
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
GALE Phase 4 Chinese Broadcast News Speech
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2017]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2017S25
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
GALE Phase 4 Chinese Broadcast News Speech was developed by the Linguistic Data Consortium
(LDC) and is comprised of approximately 134 hours of Mandarin Chinese broadcast news
speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST),
Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast
News Transcripts (LDC2017T18). Broadcast audio for the GALE program was collected
at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: HKUST
(Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic).
The combined local and outsourced broadcast collection supported GALE at a rate of
approximately 300 hours per week of programming from more than 50 broadcast sources
for a total of over 30,000 hours of collected broadcast audio over the life of the
program. LDC’s local broadcast collection system is highly automated, easily extensible
and robust and capable of collecting, processing and evaluating hundreds of hours
of content from several dozen sources per day. The broadcast material is served to
the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite
systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable
television (CATV) feeds. The mapping between receivers and recorders is dynamic and
modular. All signal routing is performed under computer control, using a 256x64 A/V
matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed
to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized
closed captions (in the case of North American English) and to generate automatic
speech recognition (ASR) output. An overview of the system, the sources recorded and
the configuration of the recording laboratory are contained in the Guidelines for
Broadcast Audio Collection Version 3.0 included in this release. LDC designed a portable
platform for remote broadcast collection. This is a TiVO-style digital video recording
(DVR) system that records two streams of A/V material simultaneously. It supports
analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside
of the United States. It has a small footprint, weighs less than 30 pounds and can
be transported as carry-on luggage. HKUST collected Chinese broadcast programming
using its internal recording system and a portable broadcast collection platform designed
by LDC and installed at HKUST in 2006. *Data* The broadcast news recordings in this
release feature news broadcasts focusing principally on current events from the following
sources: China Central TV (CCTV), a national and international broadcaster in Mainland
China; Phoenix TV, a Hong Kong-based satellite television station; and Voice of America
(VOA), a U.S. government-funded broadcast programmer. This release contains 256 audio
files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel
16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure
Specification Version 2.0 which is included in this release. The broadcast auditing
process served three principal goals: as a check on the operation of the broadcast
collection system equipment by identifying failed, incomplete or faulty recordings;
as an indicator of broadcast schedule changes by identifying instances when the incorrect
program was recorded; and as a guide for data selection by retaining information about
a program’s genre, data type and topic.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and Chinese. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Caruso, Christopher
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maeda, Kazuaki
ADDED ENTRY--PERSONAL NAME
- Personal name:
DiPersio, Denise
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2017S25
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u baq d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638277
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T06
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 769-620-932-723-2
LANGUAGE CODE
- Language code of text/sound track or separate title:
baq
- Language code of text/sound track or separate title:
cat
- Language code of text/sound track or separate title:
cze
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
eus
- Language code of text/sound track or separate title:
cat
- Language code of text/sound track or separate title:
ces
- Language code of text/sound track or separate title:
tur
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T06
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish consists of dependency treebanks
in four languages used as part of the CoNLL 2007 shared task on multi-lingual dependency
parsing and domain adaptation. The languages covered in this release are: Basque,
Catalan, Czech and Turkish. LDC also released the following 2006 & 2007 CoNLL Shared
Task corpora: * 2007 CoNLL Shared Task - Greek, Hungarian & Italian (LDC2018T07) *
2007 CoNLL Shared Task - Arabic & English (LDC2018T08) * 2006 CoNLL Shared Task -
Ten Languages (LDC2015T11) * 2006 CoNLL Shared Task - 2006 CoNLL Shared Task - Arabic
& Czech (LDC2015T12) This corpus is cross listed and jointly released with ELRA as
ELRA-W0121. The Conference on Computational Natural Language Learning (CoNLL) is accompanied
every year by a shared task intended to promote natural language processing applications
and evaluate them in a standard setting. In 2006 and 2007, the shared tasks were devoted
to the parsing of syntactic dependencies using corpora from up to thirteen languages.
The task aimed to define and extend the then-current state of the art in dependency
parsing, a technology that complemented previous tasks by producing a different kind
of syntactic description of input text. The 2007 shared task added a domain adaptation
track for English in addition to the multilingual track. More information about the
2007 shared task is available at the CoNLL Previous Tasks web site. LDC has released
data sets from other CoNLL shared tasks. 2008 CoNLL Shared Task Data (LDC2009T12)
contains the English material used in the 2008 shared task which focused on English,
employed a unified dependency-based formalism and merged the tasks of syntactic dependency
parsing, identifying semantic arguments and labeling them with semantic roles. 2009
CoNLL Shared Task Data Parts 1 and 2 (LDC2012T03 and LDC2012T04) consists of the English,
Catalan, Chinese, Czech, German and Spanish resources used in the 2009 task which
included a comparison of time and space complexity based on participants' input and
learning curve comparison for languages with large datasets. 2015-2016 CoNLL Shared
Task (LDC2017T13) contains Chinese and English resources used in the 2015 and 2016
shared tasks on dependency parsing. *Data* The source data in the treebanks in this
release consists principally of various texts (e.g., textbooks, news, literature)
annotated in dependency format. In general, dependency grammar is based on the idea
that the verb is the center of the clause structure and that other units in the sentence
are connected to the verb as directed links or dependencies. This is a one-to-one
correspondence: for every element in the sentence there is one node in the sentence
structure that corresponds to that element. In constituency or phrase structure grammars,
on the other hand, clauses are divided into noun phrases and verb phrases and in each
sentence, one or more nodes may correspond to one element. The Penn Treebank (LDC99T42)
is an example of a constituency or phrase structure approach. All of the data sets
in this release are dependency treebanks. The individual data sets are: * The 3LB
Treebank (Basque) * CESS-Cat Dependency Treebank (Catalan) * Prague Dependency Treebank
2.0 (Czech) * METU-Sabanci Turkish Treebank (Turkish)
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Basque, Catalan, Czech, and Turkish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
University of the Basque Country
ADDED ENTRY--PERSONAL NAME
- Personal name:
Technical University of Catalunya
ADDED ENTRY--PERSONAL NAME
- Personal name:
Charles University
ADDED ENTRY--PERSONAL NAME
- Personal name:
Middle East Technical University
ADDED ENTRY--PERSONAL NAME
- Personal name:
Sabanci University
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T06
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u gre d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638285
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T07
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 270-733-242-642-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
gre
- Language code of text/sound track or separate title:
hun
- Language code of text/sound track or separate title:
ita
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ell
- Language code of text/sound track or separate title:
hun
- Language code of text/sound track or separate title:
ita
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2007 CoNLL Shared Task - Greek, Hungarian & Italian
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T07
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2007 CoNLL Shared Task - Greek, Hungarian & Italian consists of dependency treebanks
in three languages used as part of the CoNLL 2007 shared task on multi-lingual dependency
parsing and domain adaptation. The languages covered in this release are: Greek, Hungarian
and Italian. LDC also released the following 2006 & 2007 CoNLL Shared Task corpora:
* 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish (LDC2018T06) * 2007 CoNLL
Shared Task - Arabic & English (LDC2018T08) * 2006 CoNLL Shared Task - Ten Languages
(LDC2015T11) * 2006 CoNLL Shared Task - 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12)
This corpus is cross listed and jointly released with ELRA as ELRA-W0122. The Conference
on Computational Natural Language Learning (CoNLL) is accompanied every year by a
shared task intended to promote natural language processing applications and evaluate
them in a standard setting. In 2006 and 2007, the shared tasks were devoted to the
parsing of syntactic dependencies using corpora from up to thirteen languages. The
task aimed to define and extend the then-current state of the art in dependency parsing,
a technology that complemented previous tasks by producing a different kind of syntactic
description of input text. The 2007 shared task added a domain adaptation track for
English in addition to the multilingual track. More information about the 2007 shared
task is available at the CoNLL Previous Tasks web site. LDC has released data sets
from other CoNLL shared tasks. 2008 CoNLL Shared Task Data (LDC2009T12) contains the
English material used in the 2008 shared task which focused on English, employed a
unified dependency-based formalism and merged the tasks of syntactic dependency parsing,
identifying semantic arguments and labeling them with semantic roles. 2009 CoNLL Shared
Task Data Parts 1 and 2 (LDC2012T03 and LDC2012T04) consists of the English, Catalan,
Chinese, Czech, German and Spanish resources used in the 2009 task which included
a comparison of time and space complexity based on participants' input and learning
curve comparison for languages with large datasets. 2015-2016 CoNLL Shared Task (LDC2017T13)
contains Chinese and English resources used in the 2015 and 2016 shared tasks on dependency
parsing. *Data* The source data in the treebanks in this release consists principally
of various texts (e.g., textbooks, news, literature) annotated in dependency format.
In general, dependency grammar is based on the idea that the verb is the center of
the clause structure and that other units in the sentence are connected to the verb
as directed links or dependencies. This is a one-to-one correspondence: for every
element in the sentence there is one node in the sentence structure that corresponds
to that element. In constituency or phrase structure grammars, on the other hand,
clauses are divided into noun phrases and verb phrases and in each sentence, one or
more nodes may correspond to one element. The Penn Treebank (LDC99T42) is an example
of a constituency or phrase structure approach. All of the data sets in this release
are dependency treebanks. The individual data sets are: * Greek Dependency Treebank
(Greek) * The Szeged Treebank (SzTB) (Hungarian) * ISST-CoNLL (Italian)
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Modern Greek (1453-), Hungarian, and Italian. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dipartimento di Informatica of the University of Pisa
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Institute for Language and Speech Processing
ADDED ENTRY--PERSONAL NAME
- Personal name:
Institute of Informatics at the University of Szeged
ADDED ENTRY--PERSONAL NAME
- Personal name:
Institute of Linguistics at the Hungarian Academy of Sciences
ADDED ENTRY--PERSONAL NAME
- Personal name:
Morphologic Ltd.
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T07
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638293
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T08
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 505-782-255-628-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
2007 CoNLL Shared Task - Arabic & English
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T08
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
2007 CoNLL Shared Task - Arabic & English consists of dependency treebanks in two
languages used as part of the CoNLL 2007 shared task on multi-lingual dependency parsing
and domain adaptation. The languages covered in this release are Arabic and English.
LDC also released the following 2006 & 2007 CoNLL Shared Task corpora: * 2007 CoNLL
Shared Task - Greek, Hungarian & Italian (LDC2018T07) * 2007 CoNLL Shared Task - Basque,
Catalan, Czech & Turkish (LDC2018T06) * 2006 CoNLL Shared Task - Ten Languages (LDC2015T11)
* 2006 CoNLL Shared Task - 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12) This
corpus is cross listed with ELRA as ELRA-W0123. The Conference on Computational Natural
Language Learning (CoNLL) is accompanied every year by a shared task intended to promote
natural language processing applications and evaluate them in a standard setting.
In 2006 and 2007, the shared tasks were devoted to the parsing of syntactic dependencies
using corpora from up to thirteen languages. The task aimed to define and extend the
then-current state of the art in dependency parsing, a technology that complemented
previous tasks by producing a different kind of syntactic description of input text.
The 2007 shared task added a domain adaptation track for English in addition to the
multilingual track. More information about the 2007 shared task is available at the
CoNLL Previous Tasks web site. LDC has released data sets from other CoNLL shared
tasks. 2008 CoNLL Shared Task Data (LDC2009T12) contains the English material used
in the 2008 shared task which focused on English, employed a unified dependency-based
formalism and merged the tasks of syntactic dependency parsing, identifying semantic
arguments and labeling them with semantic roles. 2009 CoNLL Shared Task Data Parts
1 and 2 (LDC2012T03 and LDC2012T04) consists of the English, Catalan, Chinese, Czech,
German and Spanish resources used in the 2009 task which included a comparison of
time and space complexity based on participants' input and learning curve comparison
for languages with large datasets. 2015-2016 CoNLL Shared Task (LDC2017T13) contains
Chinese and English resources used in the 2015 and 2016 shared tasks on dependency
parsing. *Data* The source data in the treebanks in this release consists principally
of various texts (e.g., textbooks, news, literature) annotated in dependency format.
In general, dependency grammar is based on the idea that the verb is the center of
the clause structure and that other units in the sentence are connected to the verb
as directed links or dependencies. This is a one-to-one correspondence: for every
element in the sentence there is one node in the sentence structure that corresponds
to that element. In constituency or phrase structure grammars, on the other hand,
clauses are divided into noun phrases and verb phrases and in each sentence, one or
more nodes may correspond to one element. The Penn Treebank (LDC99T42) is an example
of a constituency or phrase structure approach. All of the data sets in this release
are dependency treebanks. The individual data sets are: * Prague Arabic Dependency
Treebank (Arabic) * CHILDES (English) * PennBioIE Oncology 1.0 (English) * Treebank-3
(English)
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Standard Arabic and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T08
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638315
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018S01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 112-363-425-685-7
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
DIRHA English WSJ Audio
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018S01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
DIRHA English WSJ Audio was developed as part of the Distant-Speech Interaction for
Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech
interaction with distant microphones in a domestic environment. It is comprised of
approximately 85 hours of real and simulated read speech by six native American English
speakers. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A),
specifically, the 5,000 word subset of read speech from Wall Street Journal news text.
This release contains signals of different characteristics in terms of noise and reverberation
making it suitable for various multi-microphone signal processing and distant speech
recognition tasks. The corpus can be coupled with related Kaldi baselines and tools
that are available here. *Data* Speech was collected in a real apartment setting with
typical domestic background noise and inter/intra-room reverberation effects. A total
of 32 microphones were placed in the living-room (26 microphones) and in the kitchen
(6 microphones). The original recordings were made at a sampling frequency of 48 kHz.
However, for the sake of compactness, the released signals in this publication are
in wav format with 16 kHz sampling frequency and 16 bit resolution. Annotations for
each acoustic sequence are included in xml format, such as microphone positions, speaker
id, speaker gender and speaker position. Additional metadata about the speakers and
images of the apartment setting are also provided. Consult the documentation accompanying
this release for more information about the collection.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ravanelli, Mirco
ADDED ENTRY--PERSONAL NAME
- Personal name:
Cristoforetti, Luca
ADDED ENTRY--PERSONAL NAME
- Personal name:
Omologo, Maurizio
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018S01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u spa d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638323
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T01
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 752-423-916-829-4
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
spa
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
DEFT Spanish Treebank
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T01
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
DEFT Spanish Treebank was developed by the Linguistic Data Consortium (LDC) and the
Language and Computation Center (CLiC), University of Barcelona. It contains treebank
annotation of international Spanish newswire text and Latin American Spanish discussion
forum data created for the DARPA Deep Exploration and Filtering of Text (DEFT) program.
DEFT aimed to improve state-of-the-art capabilities in automated deep natural language
processing with a particular focus on technologies dealing with inference, casual
relationships and anomaly detection across several languages. DEFT Spanish Treebank
supported the program's goal of deep natural language understanding. *Data* Newswire
source files were selected from Spanish Gigaword Third Edition (LDC2011T12) and were
manually sentence-segmented for DEFT. Discussion forum source files were selected
from Spanish discussion forum source data collected by LDC, consisting of continuous
multi-posts of 100-1000 words. This release contains 114 files (54,394 tokens) of
newswire data and 60 files (55,307 tokens) of discussion forum data all of which were
annotated with constituents and syntactic functions. The annotation guidelines for
DEFT Spanish Treebank are included in the documentation accompanying this release.
Source documents are presented as plain text files with one sentence unit per line.
Treebank annotation files are in xml.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Spanish. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Taulé, Mariona
ADDED ENTRY--PERSONAL NAME
- Personal name:
Maria Antonia Martí
ADDED ENTRY--PERSONAL NAME
ADDED ENTRY--PERSONAL NAME
- Personal name:
Garí, Aina
ADDED ENTRY--PERSONAL NAME
- Personal name:
Nofre, Montserrat
ADDED ENTRY--PERSONAL NAME
- Personal name:
Song, Zhiyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ellis, Joe
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T01
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u chi d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638307
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 713-266-631-883-0
LANGUAGE CODE
- Language code of text/sound track or separate title:
chi
- Language code of text/sound track or separate title:
fre
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
cmn
- Language code of text/sound track or separate title:
fra
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TRAD Chinese-French Parallel Text -- Blog
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TRAD Chinese-French Parallel Text -- Blog was developed by ELDA as part of the PEA-TRAD
project. It contains French translations of a subset of approximately 10,000 Chinese
words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06). The PEA-TRAD project
(Translation as a Support for Document Analysis) was supported by the French Ministry
of Defense (DGA). Its purpose was to develop speech-to-speech translation technology
for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains.
ELDA developed several corpora for this effort. LDC has also released TRAD Arabic-French
Parallel Text -- Newsgroup (LDC2018T13). *Data* This release consists of 444 segments
(translations units) from 17 documents. The source data is Chinese blog text collected
and translated into English by LDC for the DARPA GALE (Global Autonomous Language
Exploitation) program. Information about the ELDA translation team, translation guidelines
and validation results is contained in the documentation accompanying this release.
The Chinese source file contains 15,809 characters and the French reference translation
contains 11,769 words. The data is presented in two unicode-encoded XML files along
with an associated DTD.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Mandarin Chinese and French. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u tpi d
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018S02
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 661-620-169-416-8
LANGUAGE CODE
- Language code of text/sound track or separate title:
tpi
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
tpi
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018S02
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e was developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 200 hours of Tok Pisin conversational and scripted telephone speech
collected in 2013 along with corresponding transcripts. The Babel program focuses
on underserved languages and seeks to develop speech recognition technology that can
be rapidly applied to any human language to support keyword search performance over
large amounts of recorded speech. *Data* The Tok Pisin speech in this release represents
that spoken in the Papuan dialect region of Papua New Guinea. The gender distribution
among speakers is approximately equal; speakers' ages range from 16 years to 65 years.
Calls were made using different telephones (e.g., mobile, landline) from a variety
of environments including the street, a home or office, a public place, and inside
a vehicle. Audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format
and 48kHz 24-bit PCM encoded audio in wav format. Transcripts are encoded in UTF-8.
Further information about transcription methodology is contained in the documentation
accompanying this release. Evaluation data is available from NIST in support of OpenKWS.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Tok Pisin. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Bills, Aric
ADDED ENTRY--PERSONAL NAME
- Personal name:
Conners, Thomas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Corris, Miriam
ADDED ENTRY--PERSONAL NAME
- Personal name:
Dubinski, Eyal
ADDED ENTRY--PERSONAL NAME
- Personal name:
Fiscus, Jonathan G.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harper, Mary
ADDED ENTRY--PERSONAL NAME
- Personal name:
Heighway, Melanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Kozlov, Kirill
ADDED ENTRY--PERSONAL NAME
- Personal name:
Malyska, Nicolas
ADDED ENTRY--PERSONAL NAME
- Personal name:
Melot, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ray, Jessica
ADDED ENTRY--PERSONAL NAME
- Personal name:
Rytting, Anton
ADDED ENTRY--PERSONAL NAME
- Personal name:
Shen, Wade
ADDED ENTRY--PERSONAL NAME
- Personal name:
Silber, Ronnie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tzoukermann, Evelyne
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018S02
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u per d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638331
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018S03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 663-913-048-272-3
LANGUAGE CODE
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
per
- Language code of text/sound track or separate title:
pus
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
prs
- Language code of text/sound track or separate title:
fas
- Language code of text/sound track or separate title:
pus
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Multi-Language Conversational Telephone Speech 2011 -- Central Asian
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
spoken word
- Content type code:
spw
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018S03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Multi-Language Conversational Telephone Speech 2011 -- Central Asian was developed
by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours
of telephone speech in three distinct language varieties of Central Asia: Dari, Farsi
and Pashto. The data were collected primarily to support research and technology evaluation
in automatic language identification, and portions of these telephone calls were used
in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language
pair discrimination for 24 languages/dialects, some of which could be considered mutually
intelligible or closely related. LDC has also released the following as part of the
Multi-Language Conversational Telephone Speech 2011 series: * Slavic Group (LDC2016S11)
* Turkish (LDC2017S09) * South Asian (LDC2017S14) *Data* Participants were recruited
by native speakers who contacted acquaintances in their social network. Those native
speakers made one call, up to 15 minutes, to each acquaintance. The data was collected
using LDC's telephone collection infrastructure, comprised of three computer telephony
systems. Human auditors labeled calls for callee gender, dialect type and noise. Demographic
information about the participants was not collected. All audio data are presented
in FLAC-compressed MS-WAV (RIFF) file format (*.flac); when uncompressed, each file
is 2 channels, recorded at 8000 samples/second with samples stored as 16-bit signed
integers, representing a lossless conversion from the original mu-law sample data
as captured digitally from the public telephone network. The following table summarizes
the total number of calls, total number of hours of recorded audio, and the total
size of compressed data: group lng #calls #hours #MB c_asian fas 100 19.7 900 c_asian
prs 17 3.2 175 c_asian pus 79 14.5 709 c_asian Totals 196 37.4 1784
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Dari, Persian, and Pushto. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Sound recordings
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Jones, Karen
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Walker, Kevin
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018S03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638366
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T03
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 413-253-098-648-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TAC KBP Comprehensive English Source Corpora 2009-2014
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T03
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TAC KBP Comprehensive English Source Corpora 2009-2014 was developed by the Linguistic
Data Consortium (LDC) and contains the 3,877,207 English source documents used in
support of the TAC KBP tasks from 2009-2014. Text Analysis Conference (TAC) is a series
of workshops organized by the National Institute of Standards and Technology (NIST).
TAC was developed to encourage research in natural language processing and related
applications by providing a large test collection, common evaluation procedures, and
a forum for researchers to share their results. Through its various evaluations, the
Knowledge Base Population (KBP) track of TAC encourages the development of systems
that can match entities mentioned in natural texts with those appearing in a knowledge
base and extract novel information about entities from a document collection and add
it to a new or existing knowledge base. *Data* The source data consists of newswire,
broadcast material, and web text collected by LDC. Documents are released as a collection
of zip files for overall compactness, and ease and efficiency of use. When unpacked
the documents are all UTF-8 text files with a basic markup structure. Also provided
are a series of lists and tables to aid in specific zip file to doc mappings and the
recreation of specific test sets. See the included documentation for more information.
*Acknowledgement* This material is based on research sponsored by Air Force Research
Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045.
The U.S. Government is authorized to reproduce and distribute reprints for Governmental
purposes notwithstanding any copyright notation thereon. The views and conclusions
contained herein are those of the authors and should not be interpreted as necessarily
representing the official policies or endorsements, either expressed or implied, of
Air Force Research Laboratory and Defense Advanced Research Projects Agency or the
U.S. Government.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ellis, Joe
ADDED ENTRY--PERSONAL NAME
- Personal name:
Getman, Jeremy
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T03
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u amh d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638358
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T04
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 494-818-599-171-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
amh
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
amh
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
CONTENT TYPE
- Content type code:
computer program
- Content type code:
cop
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T04
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text was developed
by the Linguistic Data Consortium and is comprised of approximately 25 million words
of monolingual Amharic text, approximately 600,000 of which are translated into English.
Another 80,000 words are also translated from English into Amharic. The LORELEI (Low
Resource Languages for Emergent Incidents) Program is concerned with building Human
Language Technology for low resource languages in the context of emergent situations
like natural disasters or disease outbreaks. Linguistic resources for LORELEI include
Representative Language Packs and Incident Language Packs for over two dozen low resource
languages, comprising data, annotations, basic natural language processing tools,
lexicons and grammatical resources. Representative languages are selected to provide
broad typological coverage, while incident languages are selected to evaluate system
performance on a language whose identity is disclosed at the start of the evaluation.
*Data* Data was collected in the following genres: discussion forums, news, reference,
social network and weblog. Both monolingual text collection and parallel text creation
involved a combination of manual and automatic methods, which are detailed in the
included documentation. All harvested content was initially converted from its original
HTML form into a relatively uniform XML format. XML data is presented in two formats:
a "homogenized" XML format that preserves the minimum set of tags needed to represent
the structure of the relevant text as seen by the human web-page reader and a fully
segmented and tokenized version of the text. All text data is encoded as UTF-8. Also
included in this release are two tools: one to recreate original source data from
the processed XML material and the other to condition text data users download from
Twitter. *Acknowledgement* This material is based upon work supported by the Defense
Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any
opinions, findings and conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect the views of DARPA.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Amharic and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
computer program
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tracey, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wright, Jonathan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T04
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638374
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T09
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 907-641-112-060-3
AUTHENTICATION CODE
TITLE STATEMENT
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T09
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
SPADE (Syntactic Phrase Alignment Dataset for Evaluation) consists of annotated parse
trees and alignment on English sentential paraphrases extracted from machine translation
evaluation corpora and separated into development and test sets. Reference translations
from machine translation evaluation corpora were used as sentential paraphrases. They
were sourced from the following data sets released by LDC from the NIST (National
Institute of Standards and Technology) open machine translation evaluation series
(OpenMT): LDC2010T14, LDC2010T17, LDC2010T21, and LDC2013T03. *Data* Reference translations
of 10 to 30 words were randomly extracted for annotation from NIST OpenMT corpora.
Gold standard annotations of HPSG (head-driven phrase structure grammar) trees and
phrase alignments were performed, resulting in 20,276 phrases extracted from 201 sentential
paraphrases and 15,721 paraphrase alignments. Further information about the annotation
process is contained in the documentation accompanying this release. All annotation
data is presented as UTF-8 XML.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Arase, Yuki
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tsujii, Junichi
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T09
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u som d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638382
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T11
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 358-095-625-105-7
LANGUAGE CODE
- Language code of text/sound track or separate title:
som
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
som
- Language code of text/sound track or separate title:
eng
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
LORELEI Somali Representative Language Pack - Monolingual and Parallel Text
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
CONTENT TYPE
- Content type code:
computer program
- Content type code:
cop
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T11
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
LORELEI Somali Representative Language Pack - Monolingual and Parallel Text was developed
by the Linguistic Data Consortium (LDC) and is comprised of approximately 13 million
words of monolingual Somali text, approximately 800,000 of which are translated into
English. Another 100,000 words are also translated from English into Somali. The LORELEI
(Low Resource Languages for Emergent Incidents) Program is concerned with building
Human Language Technology for low resource languages in the context of emergent situations
like natural disasters or disease outbreaks. Linguistic resources for LORELEI include
Representative Language Packs and Incident Language Packs for over two dozen low resource
languages, comprising data, annotations, basic natural language processing tools,
lexicons and grammatical resources. Representative languages are selected to provide
broad typological coverage, while incident languages are selected to evaluate system
performance on a language whose identity is disclosed at the start of the evaluation.
*Data* Data was collected in the following genres: discussion forums, news, reference,
social network and weblog. Both monolingual text collection and parallel text creation
involved a combination of manual and automatic methods, which are detailed in the
included documentation. All harvested content was initially converted from its original
HTML form into a relatively uniform XML format. XML data is presented in two formats:
a "homogenized" XML format that preserves the minimum set of tags needed to represent
the structure of the relevant text as seen by the human web-page reader and a fully
segmented and tokenized version of the text. All text data is encoded as UTF-8. Also
included in this release are two tools: one to recreate original source data from
the processed XML material and the other to condition text data users download from
Twitter. *Acknowledgement* This material is based upon work supported by the Defense
Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any
opinions, findings and conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect the views of DARPA.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Somali and English. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
computer program
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tracey, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Graff, David
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ma, Xiaoyi
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wright, Jonathan
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T11
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638390
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T10
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 663-919-074-680-5
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
arz
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
BOLT Arabic Discussion Forums
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T10
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
BOLT Arabic Discussion Forums was developed by the Linguistic Data Consortium (LDC)
and consists of 813,080 discussion forum threads in Egyptian Arabic harvested from
the Internet using a combination of manual and automatic processes. The DARPA BOLT
(Broad Operational Language Translation) program developed machine translation and
information retrieval for less formal genres, focusing particularly on user-generated
content. LDC supported the BOLT program by collecting informal data sources -- discussion
forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected
data was translated and annotated for various tasks including word alignment, treebanking,
propbanking and co-reference. The material in this release represents the unannotated
Arabic source data in the discussion forum genre. *Data* Collection was seeded based
on the results of manual data scouting by native speaker annotators. Scouts were instructed
to seek content in Egyptian Arabic that was original, interactive and informal. Upon
locating an appropriate thread, scouts submitted the URL and some simple judgments
about it to a database, via a web browser plug-in. When multiple threads from a forum
were submitted, the entire forum was automatically harvested and added to the collection.
The scale of the collection precluded manual review of all data. Only a small portion
of the threads included in this release were manually reviewed, and it is expected
that there may be some offensive or otherwise undesired content as well as some threads
that contain a large amount of non-Arabic content. Language identification was performed
on all threads in this corpus (using CLD2), and threads for which the results indicate
a high probability of largely non-Arabic content are listed in arz_suspect_LID.txt
in the docs directory of this package. It should also be noted that many threads may
contain a mixture of Egyptian and other varieties of Arabic, even among the threads
that are primarily Arabic. The corpus is comprised of zipped HTML and XML files. The
HTML files are a raw HTML file downloaded from the discussion thread. If the thread
spanned multiple URLs, it was stored as a concatenation of the downloaded HTML files.
The XML files were converted from the raw HTML. *Acknowledgement* This material is
based upon work supported by the Defense Advanced Research Projects Agency (DARPA)
under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the
position or the policy of the Government, and no official endorsement should be inferred.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Egyptian Arabic. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Tracey, Jennifer
ADDED ENTRY--PERSONAL NAME
- Personal name:
Lee, Haejoong
ADDED ENTRY--PERSONAL NAME
- Personal name:
Strassel, Stephanie
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ismael, Safa
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T10
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u eng d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638404
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T12
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 504-151-596-424-6
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
Concretely Annotated New York Times
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T12
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
Concretely Annotated New York Times was developed by Johns Hopkins University's Human
Language Technology Center of Excellence. It adds multiple kinds and instances of
automatically-generated syntactic, semantic and coreference annotations to The New
York Times Annotated Corpus (LDC2008T19). Concrete is a schema for representing structured,
hierarchical and overlapping linguistic annotations. This release provides multiple
tool outputs producing the same annotation types as different annotation theories
under a shared tokenization. *Data* Concretely Annotated New York Times contains all
of the 1.8 million articles in The New York Times Annotated Corpus. Those articles
were written and published by the New York Times between January 1, 1987 and June
19, 2007; the 2008 corpus also includes metadata provided by the New York Times Newsroom,
the New York Times Indexing Service and the online production staff at nytimes.com.
The following layers of annotation were added by processing the articles under the
Concrete schema: * Segmented sentences and Penn Treebank-style tokenized words * Treebank-style
constituent parse trees * Four different syntactic dependency trees * Named entities
* Part of speech tags * Lemmas * In-document entity coreference chains * Three different
frame semantic parses See analytics.pdf for the list of tools used to create those
annotations. The data is stored in a binary form called Concrete, which is based on
Apache Thrift. Concrete can be read and written in many common programming languages,
such as Java, Python, Javascript and C++. Concrete also includes a number of utilities
to access and view the data in human-readable forms. The original NITF (News Industry
Text Format) document structure in The New York Times Annotated Corpus was preserved
in this Concrete version.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content and documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Ferraro, Francis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Thomas, Max
ADDED ENTRY--PERSONAL NAME
- Personal name:
Wolfe, Travis
ADDED ENTRY--PERSONAL NAME
- Personal name:
Gormley, Matthew R.
ADDED ENTRY--PERSONAL NAME
- Personal name:
Harman, Craig
ADDED ENTRY--PERSONAL NAME
- Personal name:
Van Durme, Benjamin
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T12
LEADER
- Record Status:
n
- Type of record:
m
- Bibliographic level:
m
- Type of control:
- Undefined:
a
- Encoding level:
3
- Descriptive cataloging form:
i
- Linked record requirement:
m o u
cu ||||||u||||
s2018 pau u ara d
INTERNATIONAL STANDARD BOOK NUMBER
- International Standard Book Number:
1585638412
OTHER STANDARD IDENTIFIER
- Standard recording code:
LDC2018T13
OTHER STANDARD IDENTIFIER
- Standard recording code:
ISLRN: 582-339-053-329-9
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
fre
- Language code of summary or abstract/overprinted title or subtitle:
eng
LANGUAGE CODE
- Language code of text/sound track or separate title:
ara
- Language code of text/sound track or separate title:
arb
- Language code of text/sound track or separate title:
fra
- Language code of summary or abstract/overprinted title or subtitle:
eng
AUTHENTICATION CODE
TITLE STATEMENT
- Title:
TRAD Arabic-French Parallel Text -- Newsgroup
PRODUCTION, PUBLICATION, DISTRIBUTION, MANUFACTURE, AND COPYRIGHT NOTICE
- Place of production, publication, distribution, manufacture:
[Philadelphia, Pennsylvania]:
- Name of producer, publisher, distributor, manufacturer:
Linguistic Data Consortium,
- Date of production, publication, distribution, manufacture, or copyright notice:
[2018]
CONTENT TYPE
- Content type code:
computer dataset
- Content type code:
cod
- Source:
rdacontent
CONTENT TYPE
- Content type code:
text
- Content type code:
txt
- Source:
rdacontent
MEDIA TYPE
- Media type term:
computer
- Media type code:
c
- Source:
rdamedia
CARRIER TYPE
- Carrier type term:
unspecified
- Carrier type code:
zu
- Source:
rdacarrier
GENERAL NOTE
GENERAL NOTE
- General note:
https://catalog.ldc.upenn.edu/LDC2018T13
RESTRICTIONS ON ACCESS NOTE
- Terms governing access:
Available to University of Alberta users only.
SUMMARY, ETC.
- Summary, etc.:
TRAD Arabic-French Parallel Text -- Newsgroup was developed by ELDA as part of the
PEA-TRAD project. It contains French translations of a subset of approximately 10,000
Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03).
The PEA-TRAD project (Translation as a Support for Document Analysis) was supported
by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech
translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from
a variety of domains. ELDA developed several corpora for this effort. LDC has also
released TRAD Chinese-French Parallel Text -- Blog (LDC2018T02). *Data* This release
consists of 398 segments (translations units) from 17 documents. The source data is
Arabic newsgroup text collected and translated into English by the Linguistic Data
Consortium for the DARPA GALE (Global Autonomous Language Exploitation) program. Information
about the ELDA translation team, translation guidelines and validation results is
contained in the documentation accompanying this release. The Arabic source file contains
10,706 words and the French reference translation contains 15,843 words. The data
is presented in two unicode-encoded XML files along with an associated DTD.
SUMMARY, ETC.
- Summary, etc.:
Data samples are available on the LDC website.
LANGUAGE NOTE
- Language note:
Content in Arabic, Standard Arabic, and French. Documentation in English.
INDEX TERM--GENRE/FORM/PHYSICAL CHARACTERISTICS
- Genre/form/physical characteristics:
Excerpts
- Source of term:
lcgft
ADDED ENTRY--PERSONAL NAME
- Personal name:
Linguistic Data Consortium
ADDED ENTRY--PERSONAL NAME
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
University of Alberta Access (Request Form)
- Uniform Resource Identifier:
https://docs.google.com/forms/d/e/1FAIpQLSd4VsEYOWoubQww-01W7IV2qDaAr4ctBJUhrJvfyN0GwoMuFQ/viewform
ELECTRONIC LOCATION AND ACCESS
- Materials specified:
Dataset documentation
- Uniform Resource Identifier:
https://catalog.ldc.upenn.edu/LDC2018T13